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PREFACE 


On behalf of the Program Committee, conference organizers, USENIX, and SAGE, 
welcome to LISA ’02: The Sixteenth Systems Administration Conference. 


Philadelphia, birthplace of the U. S. Constitution, seems an appropriate venue for the one 
conference that defines and discusses the evolving constitution of the profession of 
system administration. LISA remains a unique and essential conference created for 
system administrators by system administrators, a place where practicing system 
administrators, researchers, software developers, and vendors meet to explore the 
problems of the day, learn new solutions, and predict future research challenges and 
solutions. 


Strongly driven by social and economic forces, the constitution of our profession is 
rapidly changing. The informed professional must commit to continually learning new 
skills to keep pace in this rapidly evolving Internet culture. Every system administrator 
must now understand the fundamentals of networking and security, so LISA has been 
expanded to cover these and other timely topics that now are essential knowledge for all 
system administrators. 


Since its inception, LISA has been an ongoing testament to the problems of the day and 
challenges of the future. The unforgettable events of September 11, 2001, added some 
momentum to the already increasing importance of recovery planning, network security, 
and service monitoring. Several papers in this Proceedings ask some very interesting 
questions for the future: Is exact replay of a journal of changes the only way to reliably 
reconstruct the exact behavior of a host? Is it possible to take security too far and “‘break 
the Internet’’? Is the optimal time to apply a security patch immediately after release, or 
should one wait up to two weeks? Is there a simple mathematical model that can explain 
the cost of downtime and lack of redundancy to managers? The answers to these and 
other questions are sure to create controversy. Finding the answers, though, is very 
important to the future of the profession. 


LISA continues the tradition of providing an integrated experience for attendees that is 
much more valuable than the sum of its parts. Along with you, I will be taking advantage 
of the many diverse opportunities LISA offers for professional growth: Sharpen your 
skills with tutorials on upcoming technologies. Gain insight into the nature of the 
profession in Invited Talk sessions. Evaluate new approaches to automation, monitoring, 
security, and the evolving theory of system administration (among other topics) in the 
Refereed Paper sessions. Find people with similar interests or administrative problems at 
“Birds-of-a-Feather” sessions. Bring your perplexing technical questions to experts at 
LISA’s “The Guru Is In” sessions. Explore the latest commercial innovations at the 
Vendor Exhibition. Explore research directions in the accompanying advanced 
Workshops. Explore the controversial issues of the day with peers, researchers, and the 
people behind the Internet in the various social events and the famous “‘Hallway Track.”’ 


Thank you for your interest in LISA and, on behalf of all the organizers, we wish you a 
stimulating and rewarding conference experience. 


Alva Couch 
Program Chair 
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Work-Augmented Laziness with the 
Los Task Request System 


Thomas Stepleton — Swarthmore College Computer Society 


ABSTRACT 


Quotidian system administration is often characterized by the fulfillment of common user 
requests, especially on sites that serve a variety of needs. User creation, group management, and 
mail alias maintenance are just three examples of the many repetitive tasks that can crowd the 
sysadmin’s day. Matters worsen when users neglect to provide necessary information for the job. 
They can grow bleakest, however, at volunteer-run or otherwise loosely-coordinated sites, where 
sysadmins often collectively hope for someone else to attend to the task. 


The Los Task Request System addresses all three problems. It mitigates user vagueness with 
web forms generated from XML parameter specification files. It skirts sysadmin sloth by requiring 
one simple review and approval step to set changes into motion. It then saves time by 
automatically executing commands tailored from user input. Amidst this convenience, 
cryptographic signatures on Los directives ensure that only administrators can alter the system. 
Overall, Los aims to make life easier for users and sysadmins by standardizing and streamlining 


the submission, review, and execution of requests for common system tasks. 


Introduction 


For over a decade, the volunteer student system 
administrators of the Swarthmore College Computer 
Society (SCCS) have provided shell, mail, and web 
services to hundreds of College-affiliated users. How- 
ever, a problem arose during the 2001-2002 school 
year: nobody was volunteering to take care of com- 
mon system administration requests. The sysadmins 
had an excuse: most were seniors that year and were 
confronted with the double whammy of the formidable 
Swarthmore workload and figuring out what to do 
after college. Still, the requests kept piling up. 


Immediately, the SCCS chose to hire new sysad- 
mins from the freshman and sophomore classes. At the 
same time, however, an idea began to take form. 
Instead of having users mail the admins with only 
vague ideas of what they need to say to get things 
done, what if a web form could guide them in supply- 
ing the necessary information? Then, what if the 
sysadmins could just direct the data to some handy 
scripts and have everything taken care of automati- 
cally? The notion of turning the e-mail client into a 
system administration tool was compelling, and 
through an impossible feat of time management, 
development of the Los Task Request System began. 


From the onset, it became clear that Los would 
have to satisfy some challenging requirements: 

1. It would have to be general enough to handle 
many different types of system administration 
tasks. 

2. It would have to reduce the time required for 
common system administration tasks beneath 
the threshold of the harried student volunteer. 

3. It would have to collect and present necessary 
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system configuration information to the user in 
order to be user-friendly (e.g., no rote memo- 
rization of group names). 

4. It would have to be secure by design. Only 
sysadmins should be able to make changes to 
the system, and integrating new tasks into Los 
should never compromise its security. 


Happily, after months of programming, Los 
appears to fulfill all of these requirements. Points | 
and 3 were handled by diligent coding of no particular 
novelty; the approach to points 2 and 4, on the other 
hand, is Los’s most compelling feature. 


Los can be characterized as a “semi-automatic” 
system administration tool. Some system administration 
tools directly empower the user to make important 


. changes to the system. These “fully automatic” tools are 


carefully written to resist malicious behavior on behalf of 
the user; however, since they must have elevated privi- 
leges, there’s always a slight risk of an exploit. Los is 
designed so that only the sysadmins can activate the 
privileged part of the system. A review of the user’s 
input, or “task request,” by a responsible human, while 
relatively brief and unchallenging, is mandatory. 


The best way to understand how the Los system 
works is to follow it as it handles a single task request. 
This “bird’s eye view” will reveal that there are many 
steps involved in the process. However, it is important 
to remember that users and sysadmins themselves only 
see a small and manageable fraction of them for any 
given request. 


The process starts on the Web, where the user 
makes a selection from a catalog of available auto- 
mated tasks (Figure 1). This catalog is generated from 
a collection of task description files, which are XML 
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files that contain all the information Los needs to 
solicit and apply task information from a user. 


Using information from the description file for 
the user’s chosen task, the Los web interface retrieves 
information from the user with a “‘wizard”’-style series 
of input forms (Figure 2). The description file can 
invoke sophisticated filters that check the validity of 
input and solicit corrections (Figure 3). 


When the user finishes entering data, Los checks 
the e-mail address they specified by sending them a 


Stepleton 


verification message. The user visits a web address 
from the message, and Los sends their input on to the 
sysadmins. The user is finished and now waits for the 
task request to be fulfilled. 


A few moments later, a sysadmin sees the task 
request as an XML document attached to an e-mail. 
Surveying the user’s input, the admin decides it is 
valid and uses a small utility to forward the data to the 
Los task execution module. The utility cryptographi- 
cally signs the request with the admin’s GNU Privacy 





The Los Task Request System 


Introduction 


The Los Task Request System is an easy-to-use web application that 
automates common system administration tasks. Simply choose a task from one 
of the options below, and you'll be led through a senes of forms that gather all 
the necessary information. Once you're finished, you'll get an e-mail asking you 
to confirm your request by visiting a speael web address. The request is 
submitted for processing as soon as you go there If you have any questions 
about using Los, don’t hesitate to send them to the Los custodian at 


Iss at sees dot swarthmore dot edu 

Available tasks 
Create a SMALL mailing list 
Create a small, no-frills mailing list for your group, organization, club, etc. This 
list simply forwards messages to its members--it doesn’t do archiving digests, 


or anything else. Membership is managed manually by you with the "Manage a 
SMALL mailing list” task below. 


Apply for an SCCS account 


Get your SCCS account here! Swarthmore students, faculty, and staff need only 
fill out this simple form to get e-mail, web space, and all the other benefits and 
priveleges enjoyed by hundreds of saushed SCCS users on campus 





Figure 1: The Los task list. The user begins here. 
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Guard (GPG) [1] key before sending it. (In the future, 
the utility will be unnecessary; the sysadmin will sim- 
ply forward a signed copy of the task request e-mail to 
the execution module directly.) The sysadmin is fin- 
ished and waits for an e-mail confirming the execution 
of the task. 


The Los task execution module, having validated 
the signature on the task request, loads the appropriate 
task description file and determines what commands it 
needs to run. It gleans arguments to the specified 


i 
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commands from the task request data, and each com- 
mand is executed. Finally, Los mails the commands’ 
output to the sysadmins for review. 


By now, the SCCS has successfully adapted a 
number of system administration tasks to this auto- 
mated paradigm. Users are able to request new 
accounts, create and manage mail aliases and mailing 
lists, and allocate and control access to shared student 
organization webspace through six custom-made Los 
tasks. 


-bin/los. cgi 


Create a SMALL mailing list 


Name of the new list 


i Your new mailing list needs a name, which will eventually be prepended onto 
i @sccs.swarthmore.edu to become the address for your list. This name should be 
a single word or words seperated by a hyphen (-) or underscore (_). Try not to 
call it something that might be someone’s usemame either-- even if your band 
is named “joe”, you should think of a new name for the list since "joe" is 
| probably better as a username for someone named "joe". Examples of good 
i names: 


® bocce-club 


Initial list members 


Here, list all the e-mail addresses you wish to place on the mauling list. 





Figure 2: Using a “‘wizard’’-style interface, the user enters data for the task. 


oe 
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Related work 


The goals of Los are not especially novel. Sev- 
eral systems that automate or at least accelerate com- 
mon system administration tasks already exist. These 
seem to fall into two categories: the fully automatic 
user-centric systems that require no sysadmin inter- 
vention, and systems which are intended to be seen 
only by the administrators. A few seem to cater to 
both, depending on their configuration. 


A good first place to look for both kinds of soft- 
ware is the Internet hosting business, where the users 


Ria ree eT emir Teeter ped, 
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are owners of particular websites or other Internet 
resources and the administrator must oversee the 
servers that host them. Because site owners want to 
provide the same services on their sites that organiza- 
tions with dedicated servers can provide, it is often 
necessary for system administrators to directly config- 
ure mail transport agents, FTP daemons, and other 
systems to their needs. Some solutions to this problem 
streamline the sysadmin’s job: one example software 
package is ispbs [2], which provides a convenient web 
interface for administration of many such sites. There 
are several systems that also have user-centric 


6 Bookmarks & Laeation: thttp: //sccs. svarthnore. edu/cgi-bin/los. cgi 
Create a SMALL mailing list 
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Name of the new list 


Your new mailing list needs a name, which will eventually be prepended onto 
@sccs.swarthmore.edu to become the address for your list. This name should be 
a single word or words seperated by a hyphen (-) or underscore (_). Try not to 
call it something that might be someone’s username either-- even if your band 
is named "joe", you should think of a new name for the list since “joe” is 
probably better as a username for someone named “joe”. Examples of good 


names 


© hocce-chb 
e ‘s 
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Initial list members 


Here, list all the e-mail addresses you wish to place on the mailing list. 
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Figure 3: Task description files can invoke filters that check the validity of user input. 
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capabilities, however, including Account Systems 
Manager [3] (ASM), and ISPMan [4] which provide 
web-based configuration interfaces to the user as well. 
To make system changes, all three eventually require 
some sort of automated privileged mechanism: ispbs 
uses a script that is automatically executed as root by 
cron, ASM executes changes immediately by always 
running as root, and ISPMan places user requests into 
an LDAP database which is queried periodically by an 
execution system. 


We find similar software outside of the Internet 
hosting realm as well. System administration tools that 
take in data from sysadmins and automatically apply it 
are well known and include such software as Webmin 
[5] and Linuxconf [6] Both of these systems provide 
standardized, extensible means of gathering data from 
the system administrator and executing the requested 
changes. Linuxconf can even acquire input through 
different interfaces, including a web based interface, a 
native GUI frontend, and a text console interface. 
These systems also require privileges to work: like 
ASM, Webmin has a dedicated webserver which runs 
as root, while Linuxconf in common configurations 
uses a SUID server program executed by the xinetd 
Internet super server. Another systems of this sort is 
the Pelendur account management system [7]. 


Both Webmin and Linuxconf also have user 
accessible fully automatic capabilities. While Linux- 
conf does so by using a built-in per-user privilege sys- 
tem, Webmin employs a separate system called User- 
min [8], which also features a dedicated webserver 
running as root. Other fully automatic systems include 
Accountworks [9] and Mailman [10]. These systems, 
which manage user accounts on a corporate network 
and larger mailing lists respectively, are not as general 
as those described above. In the case of Mailman, its 
narrow application focus permits it to be easily iso- 
lated from the rest of the system, thus mitigating a 
great deal of the security risk involved in user-acti- 
vated system alteration. 


The components that make up Los are also not 
especially new. Any user of an e-mail to FTP inter- 
face, the Majordomo [11] mailing list system, or vari- 
ous e-mail based problem tracking systems is familiar 
with sending commands by e-mail. An add-on to the 
RT problem tracking system [12] even checks crypto- 
graphic signatures on e-mail directives [13]. The XML 
encoding of user data bears a resemblance to existing 
XML _ based RPC mechanisms like SOAP [14]. 
Finally, the extensible web-based user input system is 
similar to (but rather more limited than) configurable 
web-based database frontends like FileMaker [15]. 


Los Components in Depth 


The bird’s eye view detailed in the introduction 
reveals three major stages in the life of a Los task 
request: creation, review, and execution. Los takes 
advantage of these divisions with a design that 
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employs a separate mechanism for each stage. While 
the mechanism for task request review is actually the 
judgment of a discriminating system administrator, the 
other two steps are automated by two Perl programs of 
considerable complexity: the web interface and task 
execution module mentioned previously. The web 
interface is a CGI program that resides in any web 
accessible directory that permits execution of CGI 
scripts. The task execution module usually resides in a 
library directory that contains other files necessary for 
both programs. The task description files, which con- 
tain detailed specifications for the data required for a 
task and the commands for applying it, supply both 
components with the specific information they need to 
do their job. 


Both programs are designed for version 5.005 
and greater of the Perl interpreter running on relatively 
POSIX-compliant systems. However, they also 
employ several different Perl modules from the Com- 
prehensive Perl Archive Network (CPAN) [16], which 
may further restrict their use to the more familiar and 
modern Unices. A copy the GNU Privacy Guard 
encryption software must be installed for the Los task 
execution module to function. At the SCCS, Los was 
developed and runs on version 2.2 of Debian 
GNU/Linux. This section will further detail the 
design, use, and implementation of these important 
Los components. 


Task Request Creation with the Web Interface 


The business of getting task request information 
from the user is conducted by a single large CGI pro- 
gram that creates a series of “‘wizard’’-like web forms 
for the user. While one might initially suspect that this 
consists mainly of presenting questions and HTML 
form inputs to the user, the job is much more complex 
for all but the simplest of data collection tasks. Much 
of the complexity of the Los task creation script is 
designed to handle the following issues: 

1. Dynamic generation of choices . In order to 
keep things simple for the users, it is often nec- 
essary for the system to generate a list of possi- 
ble choices on the fly rather than require the user 
to remember them and type them explicitly into 
an input box. A good example of this is a form 
that allows users to alter membership in a user 
group that has write access to a web page or 
other resource. It’s much easier for the group 
members to identify their group name from a 
listing rather than spell it out on their own; later 
on, furthermore, the script must be able to list 
the members of the selected group for alteration. 

2. Dynamic checking of input. Since no task 
request is executed without a sysadmin’s 
approval, it’s not absolutely necessary for user 
input to be validated at every step. However, 
having the system perform checks on its own 
can relieve the sysadmin of having to incre- 
mentally correct the user’s choices again and 
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again. It can also allow the sysadmin to focus 
on more subtle errors in the input instead of 
typos. The issues involved in dynamic input 
checking are similar to those surrounding 
dynamic choice generation. 

3. State maintenance, revision, and security. 
Since the input script gathers information with 
a series of forms, it is necessary for it to pre- 
serve all the information the user has already 
supplied as it asks for more information with 
new forms. In the current system, this informa- 
tion is stored on the client side with “hidden” 
HTML form input elements. However, this 
requires tamper checking to make certain the 
user hasn’t maliciously altered any of the data 
stored on their end. In general, judicious man- 
agement of input is necessary to ensure that 
data is kept intact through multiple back and 
forth transactions between client and server. 


As mentioned earlier, the Los input script first 
greets the user with a catalog of available tasks. This 
simple task is accomplished with a cursory scan of all 
the task description files for title and summary infor- 
mation and is not especially complicated. More thor- 
ough examination of the task description files happens 
when the user selects a task and begins supplying data. 
Because the input script maintains all state on the 
client side, the following steps generally take place on 
each new page load: 

1. The script organizes information it loads from 
the current task description file. 

2. It determines which page of the wizard-style 
input forms the user has just completed. 

3. If the user is advancing to the next page, it 
checks their new input against the specifica- 
tions in the task description file. If there is the 
problem with the input, it prepares to show the 
last page again with error messages; otherwise, 
it readies the next screen. If, on the other hand, 
the user chose to go back to a previous page, 
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the script prepares to go backwards without 
checking input. 

4. The script now displays the new page for the 
user, a process consisting of generating the 
HTML form elements that belong on the page 
and storing all of the data the user has entered 
so far. Data entered on other pages are cached 
in hidden HTML form inputs — the rest, if the 
user has been here before, is stored in the actual 
form elements shown on the page. The style of 
the page itself is determined by a collection of 
templates that can be configured by the admin- 
istrator. 


Each page generated by the input script is 
described by one of several parameters sections of the 
task description file. The parameters section consists 
of multiple parameter entries, which each describe a 
particular piece of information needed for the task. In 
addition to providing a short description of the param- 
eter for the user, these entries also specify the proper 
type of HTML form widget for acquiring the informa- 
tion and a list of tests that validate the user’s input. 
Some parameters, like the user’s e-mail address, are 
required for all tasks; currently, if a task description 
file omits them, the input script will automatically 
insert default stand-in parameter entries into its own 
in-memory representation of the task. 


Figure 4 contains a sample parameter entry from 
a Los task description file. The selector tag invokes a 
routine in a “‘standard library” of HTML form ele- 
ments to generate the simple text input widget needed 
for this parameter. The following format tags are 
either Perl-compatible regular expressions (PCREs) or 
library calls like the selector tag, identified with pcre 
and filter tags respectively. For specialized applica- 
tions, admins may create their own collections of 
selectors and filters if they choose. 


Selectors and filters are nothing more than Perl 
routines, and relatively little effort is made to shield 





<parameter name="uname" title="Preferred username"> 


<description> 
Please choose a new username here. 
least three letters long. 
</description> 


Be sure to specify one that is at 


<selector name="Los::Selectors::Input" args="size=8 ,maxlength=8"/> 


<format> 
<pere>/.../</pcere> 


<description>This username is too short</description> 


</format> 
<format> 


<filter name="Los::Filters::Tolower"/> 


</description></description> 
</format> 
<format inverse="true"> 


<filter name="Los::Filters::IsUser"/> 


<description>This username is already taken</description> 


</format> 
</parameter> 


Figure 4: An simplified example of a parameter entry from a Los task description file. 
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them from the internals of Los. This means that they 
can actually alter the data submitted by users, a desir- 
able feature in some cases. Because the example 
parameter entry in Figure 4 is soliciting a username 
from a new user, it uses a filter called Los::Fil- 
ters::Tolower to convert the input to lowercase charac- 
ters. This is necessary for the next filter, which checks 
whether the username already exists on the system. 
Some elaborate filters take even more advantage of 
this freedom: for example, a system of password fil- 
ters for protecting access to certain Los scripts checks 
and updates a password database as it processes user 
input. 

Many arguments to selectors and filters can be 
interpolated, thereby incorporating user input into 
their operation. This is the basis of the dynamic gener- 
ation of choice mentioned earlier. Below, an example 
selector tag generates a textarea for editing group 
membership: 

<selector 
name="Los::Selectors::GroupTextarea" 
args="rows=10,cols=10,wrap=off, 
group= pagename~"/> 


The name of the group to be edited is stored in 
the variable pagename, which is named between two 
tilde characters in the group argument to this selector. 
Presumably pagename was supplied by the user in an 
earlier form page; in the script this example was taken 
from, the user chooses pagename from a menu of stu- 
dent organization webpages. 


Eventually the script runs out of form pages to 
show the user; at this point the user has entered all the 
information mandated by the task description file. The 
script does one final check of all the user input by 
checking every format item of every parameter. If all 
is well, the script creates a file storing the information 
as a formal Los task request, giving it a unique ID in 
the process. It e-mails a confirmation URL containing 


<?xml version='1.0’ encoding=’utf-8’'?> 
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the ID to the user and keeps the file on hand until the 
URL is visited. Once this occurs, the task request is 
sent at last to the system administrators for review and 
approval. 


A great deal of effort has gone into making the 
Los input script flexible enough to handle many differ- 
ent types of information gathering applications. In 
recent months, the standard selector and filter libraries 
have grown in their abilities to draw information from 
files, user and group databases, and other sources 
required by the Los tasks designed for the SCCS. Nev- 
ertheless, it is certain that there will be applications at 
other sites where these routines will be inadequate. 
Even the input script itself is limited to proceeding lin- 
early through the lists of parameters in the task 
description files — although it can modify its questions 
based on prior user input, it cannot adopt radically dif- 
ferent branches of questioning on any basis. Thank- 
fully, because Los’s modular design permits other 
interfaces to create task requests to do what the web 
interface does, tasks with complex data gathering 
needs can be handled by custom applications and still 
work with the rest of the Los system. 


Task Request Execution by Mail 


Figure 5 shows a complete Los task request, 
which the system administrators receive as e-mail 
attachments from the input script. If the request is sat- 
isfactory, the sysadmin executes the task request by 
signing it with their GPG key and sending it on to the 
task execution script over e-mail. This relatively sim- 
ple gesture belies the considerable complexity 
involved in making certain that the task request is 
trustworthy and avoiding the perils that come with an 
e-mail activated system with elevated privileges. 


Because the formatting of e-mail messages differs 
between mailers and because the Los executor is 
designed to diminish security risks by analyzing all 
aspects of an input e-mail, a special utility program is 


<!DOCTYPE los_transaction SYSTEM "los_transaction.dtd"> 
<los_transaction date="Sat, 06 Jul 2002 17:54:49 EST" 


ip="24 ..205..87.222" 


id="1025996089-489825418"> 


<taskinfo title="Create a BIG mailing list" 


version="1.0" 
creator="tss" 


taskfile="maillist_big new.xml" 


mdSsum="d64d1c85c8c7£3700826d4eda32d512£" /> 


<parameters> 


<parameter name="EMAIL">tss@sccs.swarthmore.edu</parameter> 
<parameter name="password_check">1a@S3dSF</parameter> 
<parameter name="pledge">null</parameter> 

<parameter name="password">1a@S3dSF</parameter> 

<parameter name="FULLNAME">Tom Stepleton</parameter> 
<parameter name="DESCRIPTION">null</parameter> 

<parameter name="listname">test-list</parameter> 


</parameters> 
</los_transaction> 


Figure 5: A sample Los task request — this one requests a new mailing list named test-list. 
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currently required to generate e-mails for the executor. 
The sysadmin pipes the task request into the utility, 
which encodes it in Base64 to preserve its formatting, 
solicits the admin’s GPG passphrase, signs the data with 
the admin’s GPG key, and sends it on to the executor. 


The Los executor is currently a SUID root Perl 
script. This should rouse concern in cautious system 
administrators, as SUID scripts are widely reputed to be 
dangerous [17]. However, the Perl interpreter on Unix 
systems has a special facility for safer execution of 
SUID scripts: suidperl, which, by automatically impos- 
ing a technique known as taint checking, requires the 
programmer to properly shield the actions of the SUID 
script from externally controlled influences like input 
and environment [18]. There are further security pre- 
cautions taken by suidperl to thwart the subversion 
methods commonly directed at SUID scripts, and mod- 
ern Unix systems often have mechanisms that prevent 
the race condition attacks that made all SUID scripts 
unsafe in past years. Still, some admins may be unwill- 
ing to adopt the extra risk that accompanies a new pro- 
gram with root privileges and may conclude that Los is 
not appropriate for their sites. 


When a signed task request is sent to the Los 
executor by mail, it is actually sent to a dedicated user 
whose e-mail is redirected by Procmail [19] or an 
analogous mechanism into the Los execution script. 
The script itself is owned by root and is otherwise 
exclusively executable by members of a dedicated 
group, of which the Los executor user is the only 
member. Security conscious sysadmins may elect to 
impose additional constraints of their own on the 
mechanism that directs e-mail into the Los execution 
script, such as an independent verification of the cryp- 
tographic signature on the task request or a blanket 
rejection of all messages except those from a few 
select hosts. Thus, though e-mail may seem like a par- 
ticularly unprotected mode of transit for the task 
requests, judicious application of common e-mail utili- 
ties like Procmail make it possible to carefully inspect 
and filter them first. 


Once underway, the Los executor first checks the 
GPG signature on the message. To do this, as it does 
with any external program call that doesn’t require 
root privileges, the executor spawns a subprocess that 
adopts the UID of the dedicated Los user, performs the 
work itself, and reports back to the parent through 
UNIX pipes. For this particular step, a double function 
is served, as the Los user also owns the GPG keyring 
against which the task request signature is validated. 
For the signature to be approved, the signer’s public 
key must be a trusted key (i.e., it must be locally 
signed with the Los user’s key, requiring the admin to 
su to the losuser and import and sign their key manu- 
ally) and must be explicitly mentioned in a file con- 
taining a list of keys whose owners have authorization 
to approve task requests. Unless all of these conditions 
are met, the Los executor will abort and report the fail- 
ure to the sysadmins. 
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The script moves on to carefully decode and 
parse the Base64 encoded task request, maintaining a 
healthy paranoia about unexpected input. This done, 
the script examines the taskinfo tag from the task 
request to determine which task description file was 
used to generate it. The title, version, and creator 
attributes must correspond exactly with the version 
and creator information specified in the task descrip- 
tion file, otherwise the script assumes that two differ- 
ent versions of the file are in use (a situation that 
might arise if the input script and executor are on dif- 
ferent hosts) and aborts. Optionally, the executor can 
compare the MD5 checksum of its copy of the task 
file with that of the one used by the input script. This 
is not enabled by default, however, as some admins 
may choose to have different task description files for 
task request creation and execution. 


With all its suspicions allayed, the Los executor 
can turn its attentions at last toward executing the task. 
There is still one contingency to anticipate, however. 
It is possible that the same task request might be sent 
to the executor twice by two sysadmins acting inde- 
pendently. For certain tasks, this could be harmful to 
the system. The executor prevents this by attempting 
to deposit the contents of the task request into a file 
named with the task request’s ID in a designated log 
directory. If the file already exists or if the script is 
unable to get an exclusive lock on an empty file, the 
executor presumes that the task has already been exe- 
cuted and aborts. This protective feature also doubles 
as a convenient logging mechanism. 


The Los executor finally spawns a subprocess to 
execute the task. The commands for task execution 
appear in a commands block at the end of the task 
description file, which also specifies the username 
under which the commands should be run. Immedi- 
ately the subprocess drops as many privileges as it can 
and executes each command one by one. The Los 
executor has a relatively flexible means of interpolat- 
ing variable names in command arguments. However, 
it is not as flexible as the shell and is not intended to 
be. Rather than reply on a sophisticated command line 
interpreter built into the script, it is expected that 
administrators will simply pass variables into shell 
scripts that do most of the work themselves. 


Once the subprocess is finished, its output and 
the output of all the commands it invoked are sent 
back to the sysadmins. The well-traveled and highly- 
automated life of the task request is over, and the user 
is (hopefully) satisfied. 


Los In Use at the SCCS 


The Swarthmore College Computer Society has 
prepared a number of tasks for automation by Los. 
These tasks represent a certain critical intersection 
between those that are most frequently requested by 
our users and those that are the most bothersome to 
take care of. These include the creation of new users 
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and mailing lists, the management of mail aliases, and 
the creation and management of student organization 
webpages. These are interesting problems as they 
require both the input and execution sides of Los to 
involve themselves deeply in the analysis and modifi- 
cation of different aspects of system configuration. 


The Los task description files at the SCCS have 
been written to thoroughly screen user input for cor- 
rectness. In cases like mail alias creation, this requires 
the input script to check whether the new alias name 
isn’t already being used by users, existing mail aliases, 
or Mailman mailing lists on our system. Organization 
web page management requires the culling of group 
membership information as well as password- 
restricted access. The standard selector and _ filter 
libraries handle these jobs capably, though as new 
needs for Los arise, there will doubtless be a need for 
more library functionality. 


For most of the SCCS tasks, the Los executor 
simply invokes an external script with the data col- 
lected from the user. Often these scripts must modify 
configuration files like the mail aliases database, a 
task which requires care even when performed by a 
human administrator. To make this modification task 
simpler, Los comes with a utility that allows the 
scripts to perform the modifications in a ‘trecord-ori- 
ented” manner: within a section of the file delineated 
by special comments, the utility adds, alters, and 
removes comment-delimited records of text provided 
by the scripts. 


This utility exhibits a high degree of caution and 
will fail if the record section and record delimiters are 
not all well-formed. Similarly, whenever possible, the 
scripts employ the system file modification tools sup- 
plied with our Linux distribution, including useradd and 
gpasswd for modification of the user and group 
databases respectively. The SCCS believes that stan- 
dardized tools are the key to safe automated modifica- 
tion of important system configuration files, and we 
abide by this in our executor scripts as often as possible. 


Creating Los task description files really is pro- 
gramming, and the time it takes depends on how thor- 
oughly the sysadmin wishes to check user input and 
how difficult it is to safely automate the execution of 
task requests. The selector and filter lines in parameter 
entries are comparable to function calls and tend to 
each take several different arguments. It would not be 
difficult to make an interface for task description file 
creation that uses a web browser and dialog boxes to 
simplify the task; indeed, this might greatly speed the 
process, as much of the development effort goes into 
manually creating the XML and remembering the 
arguments to selectors and filters. 


For the moment, setting up Los for a new task 
can take some time, on the order of several hours for 
us at the SCCS. Furthermore, if the task requires a 
custom selector or filter, some rather involved Perl 
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programming may be required, as the interface the Los 
input script uses to invoke the selectors and filters is 
complex. This may change in the future, but current 
efforts are focusing on making the standard libraries 
more versatile and complete. 


For the time being, expeditious programming of 
Los task description files requires planning before- 
hand. The admin can work backwards, starting with 
figuring out how tasks can be executed automatically 
and then determining exactly how to obtain the infor- 
mation needed for task execution through Los. Once 
choices about parameter inputs have been made, the 
admin can start thinking about what checks they wish 
to apply to the user’s input. At last, with all of these 
things established, the admin can code up the task 
description file parameter entry by parameter entry. 
Thankfully, once the task description file has been 
completed, installing it into the catalog of Los tasks is 
as simple as dropping it into the same directory as all 
the other Los tasks. The task appears in the listing, 
ready to use, the next time the main Los catalog page 
is loaded. 


Los was completed at the SCCS in the final 
months of the 2001-2002 school year. As such, most 
students at Swarthmore were too busy to request the 
tasks Los has been configured to handle. After the end 
of the semester and up to the time of writing (mid 
summer), requests have been understandably sporadic. 
Los has indeed capably handled these requests and has 
dramatically improved sysadmin response time, usu- 
ally finishing within a couple hours of the request sub- 
mission the business that could take up to a week 
depending on the demands of our courses or the dis- 
tractions of summer. 


However, it has yet to face the normal SCCS 
request workload or the heavy period that comes at the 
beginning of the school year. The SCCS fully expects 
Los to greatly improve our service to the college com- 
munity under these stresses, and by the time of the 
2002 LISA conference we intend to quantitatively 
demonstrate this improvement. 


One thing we are capable of measuring now is 
the performance of the Los system on the SCCS 
servers. Currently we’re running both the Los input 
script and the executor on our main login server, a 400 
MHz Pentium II-based machine with 380 MB of 
memory. Because Los is frequently opening files, gen- 
erating NIS or LDAP queries, or doing whatever it 
needs to do to to get the information for its selectors 
and filters, it is not a champion of speed. 


The Los task catalog on the SCCS takes just 
under four seconds to load on the Swarthmore net- 
work, with the time mostly occupied by the superficial 
scan of the six different task description files we use. 
For some of the more complicated tasks, it can take 
about the same time to proceed from one wizard 
screen to the next. Even when the input script is just 
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creating HTML without doing any input checking or 
complicated widget generation, the overhead of load- 
ing and parsing the task description file and otherwise 
getting things ready can take about a second. 


Naturally, the speed of the execution script 
depends on the particulars of the task being executed. 
Though the SCCS task description files exhibit con- 
siderable complexity when it comes to checking the 
user’s input, the fact is that once the input is collected, 
there isn’t much work to do for our tasks. Typically a 
few files will be modified and group membership will 
be altered, and then the task is finished. Our non-sci- 
entific gauge of how long a task request takes to exe- 
cute, which involves approving a task and then enthu- 
siastically mashing the TAB key in the Pine mailer’s 
message index, indicates that most of our tasks take 
between five and ten seconds to be executed. 


Future Directions in Los 


After months of development, Los has grown 
into a system that meets all the goals the SCCS set out 
for it. It provides a straightforward interface for com- 
mon system tasks, eliminating the usual e-mail dialog 
needed to determine exactly what the user wants. It 
provides an antidote to the “‘someone else will do it” 
syndrome of the busy volunteer sysadmin by drasti- 
cally reducing the time it takes to attend to these tasks. 
However, Los is surely not right yet for everyone. 
This section lists some possible improvements to Los 
or similar semi-automatic system administration sys- 
tems. 


More thorough task delegation. For the SCCS, 
Los abbreviates the time it takes to attend to system 
tasks enough that it’s not necessary to formally assign 
task requests to system administrators to ensure that 
someone attends to them. Indeed, this is probably not a 
good strategy for us, as it’s hard to predict when a par- 
ticular admin has time to attend to the system rather 
than coursework. However, it might make sense at some 
sites for task requests to be automatically delegated to 
members of the administration team. Modifying Los to 
send task requests to particular administrators would be 
fairly easy — making a system that carefully manages 
who was assigned what would be harder. 


Accountability through cryptography. At the 
moment, Los doesn’t record who authorized the exe- 
cution of a task request. Since a cryptographic signa- 
ture is required for this to happen, a great deal more 
could be done to indicate incontrovertibly who autho- 
rized the execution of a task. This also would demand 
relatively little modification of Los, though it does 
demand a secure, external means of logging task 
requests and signatures. 


Use of XML based RPC standards. As hinted 
in the references section, a Los task request is little 
more than a remote procedure call. Los happens to use 
XML for task requests and task description files 
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mostly due to the great amount of support for XML in 
Perl and elsewhere, and thus relatively little attention 
was given to modeling Los’s data transaction formats 
after established XML-based standards. However, it 
may be more beneficial from an integration and versa- 
tility standpoint to use one of the standard XML-based 
RPC message formats such as SOAP [14] or XML- 
RPC [20]. Los task requests could then conceivably be 
used with other systems besides the Los executor. 


Integration with Linuxconf. As mentioned pre- 
viously, Linuxconf is a powerful collection of tools 
designed to automate and provide a straightforward 
interface for common system administration tasks. 
Unlike Los, however, Linuxconf is designed for the 
system administrator and focuses on the kinds of sys- 
tem parameters the user shouldn’t necessarily have to 
deal with (firewall setup, printer configuration, etc.). 


At sites with a lot of personal Linux worksta- 
tions, however, users may legitimately wish to alter 
these system parameters of desktop machines while 
administrators might prefer not to give them root 
access. One solution might be to use Los to collect 
configuration requests from the user and then to 
invoke the powerful Linuxconf modules with the Los 
executor to make the changes. 


Linuxconf already does much of what Los does 
with respect to collecting and applying data, so it may 
instead make sense to adapt Linuxconf to the semi- 
automatic approach to task request approval. 


Easier review of task requests. Right now the 
review of Los task requests requires the sysadmin to 
visually parse XML to determine whether the user’s 
input is appropriate. This is not especially difficult, but 
it could be streamlined by a program that interpreted 
Los task requests, combined them with the informa- 
tion in their corresponding task description files, and 
generated more legible representations of the user’s 
data. Standardized technologies like XSLT [21] could 
make this a rather straightforward task, 


Easier authorization of task requests. One 
extremely desirable improvement to Los is the elimi- 
nation of the clumsy approval script. It would be bet- 
ter if the executor were capable of taking signed e- 
mail forwards from any mail client, determining 
whether the signature was valid, and executing the 
task request. This is difficult, however, as it requires 
careful analysis of the e-mail, and of course the conse- 
quences of a misjudgment could be dire. Further com- 
plicating matters, some mailers have different behav- 
iors when it comes to signing e-mails with attach- 
ments, let alone signing forwards of e-mails with 
attachments. Still, the benefits of being able to simply 
forward a task request to the Los executor make this 
an eminently worthwhile goal. 


Novel input methods. The web interface to Los 
is a fairly satisfactory means of acquiring input from 
the user. Pains have been taken to make the default 
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template for Los’s HTML output attractive in Lynx and 
other text-based browsers. Still, it. might be useful to be 
able to submit Los task requests from handheld com- 
puters, kiosks, embedded specialty systems, or other 
devices where web browsers are not practical. Unfortu- 
nately, this may require a restructuring of Los, as the 
selector and filter lines in the task request files are tied 
fairly exclusively to the Web-only standard libraries. 


Integration with problem tracking systems. 
Though not necessary for the SCCS, some sites might 
benefit from managing Los task requests with a prob- 
lem-tracking system. Users could check on the status of 
their task requests, and sysadmins could tell at a glance 
which tasks were awaiting attention. A history of exe- 
cuted tasks would also be available for later perusal. 


Limited experimentation on integrating Los with 
problem tracking software has already taken place. 
Because Los uses e-mail as its transaction transport 
mechanism, the e-mail based GNATS system [22] was 
an easy choice. It was not difficult to alter the CGI 
input script and the task request approval script to cre- 
ate and interpret GNATS problem report e-mails. 
However, there are opportunities for tighter integra- 
tion. Just as you can now edit a problem report by 
specifying its category and ID number on the com- 
mand line (as in edit-pr mycategory 532, a sysadmin 
should be able to approve a task request in the same 
fashion with a script that automatically updates the 
status of the problem report that contains it. This 
should not be a challenging task. 


One issue that remains to be resolved is the 
encoding of task requests in GNATS problem reports. 
When the input script sends a task request to the 
sysadmins, it places it in a Base64 encoded MIME 
attachment to avoid corruption of the data. Most Los 
applications don’t need this precaution, but the risks of 
quoted-printable mail encoding, CR to CRLF conver- 
sion, and other e-mail mutations make it a prudent 
one. A default GNATS installation does not deal well 
with MIME-encoded e-mails, though this issue is 
being addressed. For the time being, though, another 
encapsulation mechanism may be necessary. 


Use on multiple systems. At the moment, thanks 
to the stopgap task approval script, Los task requests 
can only be directed toward a single Los executor, and 
thus a single computer system. Certainly the script 
called by the executor could use a tool like Igor [23] to 
trigger changes on many systems at once. What about 
a situation where different kinds of task request must 
be executed on different machines? 


When the task approval script is eliminated, the 
sysadmin will be able to simply direct task requests 
toward the proper computer by altering the To: filed in 
their e-mail client. However, because executing a task 
request on the wrong system could have negative con- 
sequences, it might be appropriate to also place extra 
information in Los task requests that prevent them 
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from being executed on the wrong system. This would 
not be an especially difficult modification. 


Getting Los 


In order to encourage widespread use and enthusi- 
astic development, Los has been released under the 
most recent version of the BSD license. Los can be 
downloaded from the Free Software Foundation’s 
Savannah development repository at http://savannah. 
gnu.org/projects/los/ . 
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Spam Blocking with a Dynamically 
Updated Firewall Ruleset 
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ABSTRACT 


In this paper, we detail our methods for controlling spam at a small ISP, reducing both 
resource usage and customer complaints. We will discuss our initial unsuccessful tactics, and the 
resulting development of our unique spam blocking system. Deny-Spammers classifies hosts as 
probable spammers and inserts those hosts into a dynamically updated firewall ruleset on our mail 
server, thereby effectively blocking the host from making an SMTP connection to our mail server. 
Our analysis demonstrates that this has been effective in reducing the amount of spam that our 
customers receive, and the burden on our limited resources. 


Introduction 


Are you aggravated by spammers launching 
what are effectively Denial of Service Attacks against 
your mail server? We were, and after several attempts 
at using some established spam-control techniques, we 
recognized the need to create our own novel approach, 
which we affectionally call ‘‘Deny-Spammers.” 


With the abundance of spam on the Internet 
today, nearly every ISP finds themselves forced to par- 
ticipate in some kind of spam! blocking. 


Jon Postel wrote in RFC 706 [1], ‘‘there is no 
mechanism for the [email] Host to selectively refuse 
messages. This means that a Host which desires to 
receive some particular messages must read all mes- 
sages addressed to it. Such a Host could be sent many 
messages by a malfunctioning Host. This would con- 
stitute a denial of service to the normal users of this 
Host. Both the local users and the network communi- 
cation could suffer.” 


While this scenario is common enough today, it 
was a shocking thought in 1975 when Postel authored 
RFC 706. Then, the Internet was still a place of co- 
operation where users operated with the “‘greater good 
of the net” in mind. Today’s mail servers operate 
under an almost constant threat of Spam Denial of 
Service attacks. 


In an article in The New York Times from June 27, 
2002, Jennifer Lee writes, “Brightmail, which maintains 
a network of In boxes to attract spam, now records 
140,000 spam attacks a day, each potentially involving 
thousands of messages, if not millions.” [2] A similarly 
bleak report from Hotmail states that 80% of its almost 
two billion processed email messages are spam [3]. 


Of particular interest to us as an ISP is the reac- 
tion of the customer base to spam in their inbox. A 


'Spam” is the term commonly used to refer to mass- 
emailed, unsolicited commercial email (also known as 
UCE), sent by a person or organization usually referred to as 
a “spammer.” 
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report by Gartner Consulting states that 53% of its 
respondents place the blame for spam on their ISP. 
They found that UCE (Unsolicited Commercial 
Email) ranks fourth in reasons for customer churn [4]. 


For these reasons, having a spam control policy 
is no longer an option for an ISP, no matter what its 
size. Hotmail subscribes to MAPS (Mail Abuse Pre- 
vention System) RBL (Realtime Blackhole List) [5]. 
Both AOL and Earthlink advertise their spam filtering 
as a benefit to their services. 


Telerama is a small ISP, established in Pitts- 
burgh, PA in 1991. Our mail system consists of a sin- 
gle server for both incoming and outgoing mail. It 
uses a 1 GHz Athlon processor and has 640 MB of 
RAM, running FreeBSD 2.2.8-STABLE. Our mail 
transport agent is gmail-/.03 [6]. In addition to the 
stock gmail distribution, we are using the gmail-uce 
checklocal patch [7] to reject mail for non-existent 
mailboxes. 


This server handles all incoming and outgoing 
mail for approximately 7,000 accounts, including well 
over 600 hosted virtual domains, and their associated 
addresses. The server typically delivers between 
50,000 and 70,000 incoming e-mail messages to local 
users in a 24-hour period. Attempted deliveries, 
including messages to non-existent mailboxes, varies 
between 100,000 to 140,000 messages in the same 
time period. We approximate that 50% of the 
attempted deliveries are blocked by the gmail-uce 
checklocal patch alone. 


From the user’s perspective, spammers cause a 
general slowing down of our entire mail system. At 
times, a single spammer would open hundreds of simul- 
taneous SMTP connections to our mail server, dumping 
thousands of messages into our mail queue. This causes 
delays in message delivery lasting from minutes to 
hours. A user sending mail through our system could 
wait up to 30 seconds before a 220 response [8] code is 
returned by the SMTP server. As most spam is gener- 
ated during business hours, the mail queue would 
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shrink back to manageable sizes in the evening. The 
next day, the problem repeats itself. When the snappy 
performance we are used to diminishes, our users start 
to complain about the sluggish performance. We 
wanted to be able to identify spammers and simultane- 
ously block them in order to prevent degradation of the 
performance of the mail server. 


What We Tried First 


Two alternate approaches we attempted before 
developing Deny-Spammers were: 
e using qgmail-uce’s checklocal patch to deny 
mail for non-existent mailboxes 
¢ using ucspi-tcp’s rbismtpd [9] in conjunction 
with several RBL sources 
First, we attempted to implement the gmail-uce 
checklocal patch alone. This patch makes qmail reject 
mail for non-existent mailboxes. gmail, by default, 
accepts mail for non-existent addresses. It determines 
later whether or not the user exists. 


Unfortunately, this did not prevent spammers 
from getting connected to the mail server and getting 
messages into the mail queue. Many spammers make 
several parallel SMTP connections, so it was not 
uncommon to see 50 or more SMTP connections from 
a single IP address. Although the checklocal patch 
prevented messages to non-existent addresses from 
entering our mail queue, spammers essentially created 
a Denial of Service to the users of our mail server. 


We immediately realized that the qmail-uce 
checklocal patch would not solve our problem on its 
own. Our next approach involved using rb/smtpd, 
which is part of the ucspi-tcp [10] package that 
includes tcpserver [11]. rb/smtpd attempts to block 
mail from RBL-listed sites by querying one or more 
RBL sources. rb/smtpd works with any smtpd server 
that runs under tcpserver. It can be configured to 
respond with a permanent (553) failure error message 
or a temporary (451) failure error message. 


We attempted to implement rb/smtpd in several 
ways. First, we configured rb/smtpd to run continu- 
ously and deliver a temporary (451) error to RBL- 
listed sites. This prompted many customer complaints 
regarding legitimate mail not getting through. Next, 
we set up rblsmtpd to deliver a permanent (553) error 
to those RBL-listed sites. Again, this caused many of 
our customers to complain about mail bouncing. 


Our last attempt involved running rblsmtpd only 
during times when we were being heavily spammed. 
Because it was configured to deliver a temporary error 
to RBL-listed sites, it was effectively useless. Spam 
would just queue up on the originating server until we 
turned off rb/smtpd. At that point, it was obvious that 
we needed something other than rblsmtpd. rblsmtpd 
did not help us solve the problem of our mail server 
getting pummeled by spammers’ SMTP connections, 
and it brought on many complaints. 
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Some other alternatives to rb/smtpd include utili- 
ties such as Sieve [12] or SpamAssassin [13]. Unfortu- 
nately, these utilities must process each message indi- 
vidually. This results in a significant increase in the 
overall system resources required by the mail system. 
Compounded with the fact that spammers were 
already utilizing all of our server’s resources, these 
utilities were not an option. Another option was to buy 
faster hardware or multiple servers. We opted for a 
homegrown software solution before investing in 
more hardware. 


Design Goals 


Each site is going to have its own unique prob- 
lems when implementing an “out of the box” spam 
filter. In order to effectively implement a spam filtra- 
tion system, a site needs to address these questions: 

¢ What has not worked for us in the past? 

e Do we have enough resources to allow client- 
side filtering options? 

° Do we have the time and expertise to create our 
own spam blocking solutions? 

e Would it be more effective to purchase faster 
and better hardware than to script a custom 
solution? 

e How transparent does the spam blocking need 
to be to the user base? 

e Are we concerned with bandwidth consumed 
by spam attacks? 


After addressing the questions above, and ruling 
out failed approaches, we realized that we needed to 
refine our goals for a spam filtering system, and would 
probably need to engineer our own software-based 
solution based on those design goals. 


Our requirements were: 

e The method must conserve system resources. 

¢ The method must reduce the amount of band- 
width consumed by spam attacks. 

e The method must not add much additional 
overhead to mail processing. 

e The method must prevent spamming sites from 
getting mail into the mail queue. 

e The method must be manageable in a way that 
allows us to exempt certain hosts or networks. 

e The method must keep our customers happy by 
minimizing the number of false positives. 

¢ The method must be as transparent as possible 
to end users. 


Our number one concern was to implement a 
solution which solved our problem without increasing 
the load on our mail server, as we did not desire a 
hardware upgrade at the time we were developing 
Deny-Spammers. This limitation ruled out utilizing a 
processor intensive spam control system, such as Spa- 
mAssassin or Sieve. 


In addition to the headaches of a major mail 
migration, simply throwing hardware at the spam 
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attacks would only be a short term solution. There has 
been, and will likely continue to be, a practically expo- 
nential growth in the amount of spam on the Internet. In 
a matter of months, our new hardware could be over- 
whelmed by additional and more creative spam attacks. 


A hardware solution also fails to rein in the prob- 
lem of bandwidth consumption by spam attacks. We 
simply wanted to reduce the ability of spammers to 
consume our resources (such as bandwidth and CPU 
utilization) conserving them for legitimate mail. 


We choose to use frequency of attempts to deliver 
email messages to non-existent mailboxes as our heuris- 
tic. We later felt validated in choosing this metric 
because it was proposed by Jon Postel in RFC 706, 
which states, ““A Host might make use of such a facility 
by measuring, per source, the number of undesired mes- 
sages per unit time, if the measure exceeds a threshold, 
then the Host could issue the ‘refuse message from Host 
X message...’ ”’ Other metrics, such as number of con- 
current SMTP connections could be used to qualify the 
sending SMTP server as a spammer. 


Once identified as a spammer, a method is 
needed to block future delivery attempts from that 
host to our mail server. We chose the most efficient 
method, which is to do the filtering at the IP level, in 
order to put the load of the filtration process on the 
operating system’s kernel, rather than at the applica- 
tion level. Filtering at the application level, such as is 
done by SpamAssassin and Sieve, for example, would 
not prevent a spam Denial of Service attack. 


Implementation/Solution 


Deny-Spammers is a daemon that interfaces to 
the mail transfer agent and the firewall ruleset control 
program. It uses a strategy based upon patterns that 
spammers produce, including attempts to send to non- 
existent addresses, to dynamically update the server’s 
ingress rules. This approach moves blocking spam 
away from the delivery agent to the mail server’s ker- 
nel, thus conserving system resources. 


We chose to implement our filtration tool in Perl 
to increase the speed of development. This allowed us 
to have a working prototype within a few days of our 
idea’s inception. Although the Perl implementation 
works well in production for our system, larger sites 
may want to consider using a more efficient develop- 
ment language. This becomes even more of a neces- 
sity as more heuristic tests are introduced. 


We needed a solution that would work with the 
software that we were already using. Therefore, the 
current revision of Deny-Spammers is_ specifically 
designed for use on FreeBSD systems running gmail. 


Presently, Deny-Spammers has two major pre- 
requisites: 
® any version of FreeBSD with IP firewall sup- 
port installed in the kernel 
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¢ a patched version of qmail that includes the 
qmail-uce checklocal patch to reject mail for 
nonexistent addresses 


Deny-Spammers interacts with the kernel’s fire- 
wall by using FreeBSD’s ipfw [14] program to ban and 
un-ban hosts. ipfiv provides the user-level control of 
the firewall ruleset. Using Deny-Spammers with 
another operating system’s firewall application would 
require additions to the code. Support for iptables, 
ipchains, packetfilter or IP Filter would all be simple 
to add. Modifying the system calls that Deny-Spam- 
mers makes would suffice to add support for any of 
these packet filters. 


The types of patterns produced by spammers 
determines how Deny-Spammers interfaces to the 
mail transfer agent. In our case, spammers are 
detected by multiple sends to nonexistent addresses. In 
order to detect delivery attempts to nonexistent 
addresses in qmail, the qmail-uce checklocal patch 
was required. This patch provides a modification for 
qmail’s SMTP daemon, qmail-smtpd, and logs details 
about each attempted delivery to a nonexistent mail- 
box, including the IP address of the host which 
attempted this delivery. 


Although Deny-Spammers is currently depen- 
dent on the gmail-uce checklocal patch, it could be 
adapted to function with other MTAs. The exact 
implementation, would depend on the MTA itself. 
Sendmail, for example, defaults to logging delivery 
attempts to nonexistent addresses. 


Deny-Spammers is intended to be executed at 
boot time and to run continuously. Many parameters 
can be fine-tuned from within the script. More infor- 
mation about these can be found in the code itself and 
in the implementation details that follow. To obtain the 
source code for Deny-Spammers, see the availability 
section near the end of this paper. 


Because the checklocal patch logs all attempted 
deliveries to nonexistent addresses, Deny-Spammers 
monitors the system’s mail log for messages emitted by 
the patch. These messages are parsed for the IP address 
of the host that attempted to make such a delivery. By 
defining thresholds of how many of these messages can 
be seen in a given time period, Deny-Spammers selec- 
tively prohibits hosts from making any more SMTP 
connections for a given “ban time” period. 


To produce this behavior, three hash structures 
are used to track the state of the spam filtering system. 
One is used to track the times that hosts sent undeliv- 
erable messages, another for banned spammers, and 
another for the exception list. 


The “spammer hash is a hash of lists. The keys 
of the hash are host IP addresses. The values of the 
hash are lists for each host which contains a list of 
timestamps. These timestamps represent times in 
which a host sent mail to a nonexistent address. 
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The %banned hash is a plain hash. The keys of 
the hash are the host’s IP address. The values of the 
hash are scalars containing the timestamp in which 
that host was banned. 


The %noban_list hash is a 4-level hash which 
contains the IP address exception list. This hash is 
organized such that the first level represents the first 
set of octets, the second level represents the second set 
of octets, and so on. The keys of this hash represent 
octets, with asterisks interpreted as wildcards. The val- 
ues of the hash are ‘1’ if a host or network is on the 
exception list. 


The exception list is populated with the contents 
of the exception list file specified when the program 
starts. It is a flat ASCII text file containing one IP or 
network per line. An example exception list is pro- 
vided with the distribution. The list is periodically re- 
read by the script and any necessary firewall changes 
happen automatically. 


There are a number of variables in the beginning 
of the script that may need to be adjusted based on the 
application. The most important parameters are the 
number of non-deliverable message attempts during a 
time span that occur and the time span itself. Ten 
attempts during a five-minute period has worked for us. 
The ban time, or length of time a host stays banned, is 
also adjustable. Lower ban times will typically keep the 
size of the firewall ruleset reasonably small. 


Other variables include the path to the exception 
list, log files, ipfiv and the regular expression used to 
match incoming timestamps and IP addresses from the 
mail log input. With the gmail-uce checklocal patch, 
nonexistent user messages should appear in the sys- 
tem’s mail log as shown in Listing 2. 


While (true) { 
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Periodically the program will prune the banned 
list, unbanning hosts which have been banned for the 
length of time specified by the administrator or hosts 
which have been added to the exception list since the 
last refresh. The refresh time is also configurable. 


An end rule is also defined. If this rule number is 
ever reached, Deny-Spammers will clear the existing 
ruleset and start over. This feature is useful if a server 
can only handle a certain number of rules efficiently. 


The pseudocode in Listing 1 shows the infinite 
loop which occurs right after initialization. It is where 
virtually all of the processing occurs in Deny-Spam- 
mers. Incoming IP addresses from the mail log lines are 
matched against a regular expression and are parsed. 
The IP address and timestamps are stored for each 
nonexistent user message. Every time a line is received 
and tracked, the program decides if the IP should be 
banned based on the given parameters in the code. 


We have been using the same algorithm since 
Deny-Spammers was put into production. Only minor 
changes to the parameters and bug fixes have been 
required to produce the results we desired. For exam- 
ple, we keep the firewall ruleset small by using a rela- 
tively short ban time of three days, as opposed to, say, 
two weeks. 


We have only implemented one spam signature 
pattern so far. Other metrics could be developed and 
implemented in a similar manner as described above. In 
most settings, each test would have to be carefully cho- 
sen, designed, and assessed for minimal negative 
impact to the users. Different tests are likely to intrinsi- 
cally block more legitimate mail than other tests. If 
multiple signatures were available, sites may also want 
to pick and choose depending on the needs of their 


Match incoming lines against a regular expression for 
undeliverable messages to nonexistent addresses and parse 


timestamp and IP address. 


Skip line if host is in the exception list. 
Trim the timestamp list for this host to $MAX_SPAMMER_ENTRIES. 


Add the timestamp to the host’s list contained in the spammer 


hash. 


Check how many delivery attempts to nonexistent address this host 
has made in the sampling interval, SSPAM_TIMESPAN. 


If nondeliverable messages > SSPAM_TRIGGER then filter this IP. 


Tf current time >= Snext_refresh then 


calculate next refresh, 


reload the exception list and prune the banned hosts list 
(un-ban hosts who have been banned for $BAN_TIME) 


Listing 1: Pseudocode for infinite loop. 


ee 8 
Jan 1 00:00:00 mailhost smtpd: 1234567890.123456 12345: DENYMAIL: 


RCPT_TO:_Filter.NoUser:_ relay unknown [123.123.123.123] FROM 
<bounce@your-info.net> ADDR <abcdefgh@telerama.com> 


Listing 2: Typical nonexistent-user message. 
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users. Fine-tuning the tests may also be required, again 
depending on their needs, to achieve the desired results. 


In Production 


Figure 1 shows the number of attempted deliver- 
ies to non-existent mailboxes and the number of fire- 
wall rules over a four-day period. The first two days 
shows Deny-Spammers running. For graphing purposes 
we intentionally reset it three times. The last two days 
show what happens when Deny-Spammers is disabled. 


A five minute sampling interval was used for 
both the number of firewall rules as well as the num- 
ber of undeliverable mail attempts. 


The graph starts with one firewall rule on April 
25th. At this point Deny-Spammers was initialized. At 
this point, the firewall rules begin to increase. As the 
firewall rules approach 1,000, the delivery attempts 
average around 100-200 attempts every five minutes. 


When the firewall is reset, the firewall rules go 
back to one, and the delivery attempts begin to 
increase. As the delivery attempts decrease, the firewall 
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ruleset starts increasing at a slower rate. Notice that the 
firewall rules increase very quickly when the number of 
delivery attempts counts is high and the firewall has 
just been reset. 


Shortly after midnight on April 27th, the firewall 
is at | for the third time. The delivery attempts increased 
dramatically when Deny-Spammers was disabled for 
about six hours. 


Deny-Spammers was restarted where the firewall 
rules start to increase again (on the morning of April 
27th). A similar pattern occurs, the delivery attempts 
decrease as the firewall rules increase. 


In the last two days of the graph, Deny-Spam- 
mers was completely disabled. The delivery attempts 
average more than twice that of when the spam filtra- 
tion system was enabled. 


Limitations 


Despite its usefulness, Deny-Spammers has 
some limitations. 


Figure 1: Delivery attempts to non-existent mailboxes and number of firewall rules 
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Figure 1: Attempt deliveries to non-existent mailboxes and number of firewall rules. 
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e The exemption list only supports individual 
hosts and/or classful networks. CIDR notation 
is not supported. 

e Deny-Spammers is currently only compatible 

with FreeBSD machines running qmail. 

Another known issue with the software is that 

qmail-smtpd processes will hang for a while if a 

host is banned while they have an active connec- 

tion. This is mostly a problem when the script is 
first started as the initial surge of banning causes 

a large number of such hung processes. 

Scalability is limited in some cases. The kernel 

firewall code itself may not be able to effi- 

ciently process thousands of rules, which is a 

common scenario. We have determined that 

older revisions of FreeBSD, for example, are 
subject to this problem. With older versions, 
firewall rules were stored in a linked list. 

Newer revisions use a tree structure, making 

the handling of large rulesets much more effi- 

cient. Using an inefficient data structure ulti- 
mately caps the maximum number of rules that 

a server can handle efficiently. 

e Spammers could exploit the gmail-uce checklocal 
patch to find valid addresses. qmail, by default, 
doesn’t allow a sender to know whether or not an 
address is valid. If this was a major concern, the 
checklocal patch could be modified so that it only 
logs the attempts, instead of logging and bounc- 
ing the message. In this case, spammers wouldn’t 
be able to verify addresses, and Deny-Spammers 
would still operate correctly. 


These limitations have not prevented Deny- 
Spammers from being a very useful tool. Customers 
are not receiving as much spam and it effectively deals 
with Denial of Service instances caused by spam. Cus- 
tomer complaints have been minimal compared to our 
other approaches. When a false positive is detected by 
a user and brought to our attention, simply adding the 
correct host or network to the exception list should be 
all that is required to resolve the issue. 


Future Plans 


Deny-Spammers was developed as a quick and 
dirty hack to solve a pressing problem that we had. As 
such, it works very well for us, on our platform, and 
for our staff, who understand its limitations. 


Now that it has been in production for over a year, 
and providing the results we desire, we can see many 
areas in which to expand its functionality to make it 
attractive and usable for the general community. Some 
obvious improvements planned are addressing Deny- 
Spammer’s current lack of scalability and interoperabil- 
ity, and adding a GUI interface to allow non-administra- 
tors to access its log files and exception list. 

e Scalability issues. Implement Deny-Spammers 
for a mail server farm or a single server with 
many more users. 
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e Add the ability to use a separate firewall. Cur- 
rently, the firewall must be located on the same 
machine that is processing mail. A separate fire- 
wall could be updated remotely via an ssh tun- 
nel. This feature could also be useful when scal- 
ing this application to a mail server farm, so that 
one firewall could be responsible for a group of 
mail servers. Given a secure communication 
mechanism to update the firewall rules, such an 
improvement should be straightforward. 
Integration with third-party applications such as 
SpamAssassin or Anomy Sanitizer. Allow 
results from SpamAssassin/Anomy to deter- 
mine whether or not a host gets banned. 

¢ Improve statistical generation for research pur- 
poses. Create historical averages of number of 
hosted blocked over long periods of time. Look 
for interesting patterns, such as whether spam 
comes in bursts and when it most frequently 
occurs. 

e Develop a better interface for unbanning hosts 
and managing the exception list. Add CIDR 
notation support for the exception list. 

¢ Interoperability with other operating systems. 

This is simply a matter of adapting the firewall 

system calls to work with various firewall 

implementations (ipchains, iptables, IP Filter, 
packetfilter). 

Interoperability with other mail transfer agents 

(sendmail, postfix, et cetera). Because each pro- 

gram works slightly differently, interfacing each 

spam signature pattern could be a tedious process. 
¢ Develop more ‘spam signatures.” A few other 
patterns we’ve considered using as criteria are: 

1. The number of concurrent SMTP connec- 
tions made by a host — experience has 
shown that spammers are capable of mak- 
ing many parallel SMTP connections to 
the same mail server. 

2. The number of recipients a message is sent 
to — Spammers often send messages with 
extremely large RCPT TO lists. 

e A point system for hosts could be introduced, 
such that multiple spam signature patterns are 
taken into account for each host (similar to the 
‘hits’ mechanism used by SpamAssassin). 


Availability 


Deny-Spammers is freely available source code 
and documentation can be found at http://deny-spam- 
mers.telerama.com. Deny-Spammers is written in Perl 
5 and developed in and tested under FreeBSD. It con- 
tains no dependencies on non-standard modules or 
libraries. Specific questions regarding Deny-Spam- 
mers can be sent to denyspam@telerama.com. 


Conclusions 


This paper describes a stateful inspection strat- 
egy for dynamically creating firewall rules that block 
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access from mail hosts based upon their recent behav- 
ior. If the sending mail host is determined to be a 
spammer (based on our criteria) a daemon updates the 
firewall ruleset for our mail server. 


Proving the success of our strategy is difficult due 
to the impossibility of measuring the lack of an event 
(lack of delivery of spam messages). Many alternative 
approaches were too resource-intensive for us to imple- 
ment. We found that other spam filters that were accept- 
able for us resource-wise (such as MAPS-RBL) created 
large and obvious negative customer feedback. We have 
not had that backlash upon implementing Deny-Spam- 
mers. We feel that we have fewer customer complaints 
about receiving spam, but we don’t have enough data to 
support that point empirically. 


What our data does show is that we are banning 
thousands of misbehaving mail hosts based on our 
metrics. We believe all of these hosts to be likely 
spammers. 


Authors 


All of the authors have worked at Telerama Inter- 
net (http://www.telerama.com) for the past several 
years. 


Deeann M. M. Mikula is the Director of Opera- 
tions and Junior Unix System Administrator at Tel- 
erama Internet, and is co-founder of the local SAGE 
Chapter in Pittsburgh. She has worked as a Behavioral 
Neuroscience Researcher, a Coffeehouse Manager and 
a Visual Artist. When not in front of a keyboard, she 
can be found drinking scotch at a Gothic club or paint- 
ing and drawing. Deeann can be reached via email at 
deeann@telerama.com . 


Chris Tracy is Telerama Internet’s Senior Net- 
work and Systems Engineer and a SCinet Volunteer. 
He holds a Bachelor’s Degree in Computer Engineer- 
ing from the University of Pittsburgh. When not in 
front of a keyboard, Chris can be found drinking beer, 
DJ'ing or playing drums. Chris can be reached via 
email at chris@telerama.com . 

Mike Holling is a part-time Network Engineer 
with Telerama Internet. He holds Bachelor’s Degrees in 
Computer Science and Electrical Engineering from 
Carnegie-Mellon University. When not snow boarding, 
skate boarding or drinking beer, Mike can be found 
working as a Network Consultant in Whitefish, Mon- 
tana. Mike can be reached at myke@telerama.com. 


Acknowledgments 


We are grateful to many people for their contri- 
butions to this project and this paper. We would like to 
thank Doug Luce, owner and CEO of Telerama, for 
fostering a workplace where we are encouraged to try 
novel approaches to problems. We thank our fellow 
staff at Telerama for putting up with the real-time 
tweaking of our production mail server. 


Especially valuable in producing this paper was 
peer review and support. We would like to thank 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


Spam Blocking with a Dynamically Updated Firewall Ruleset 


Esther Filderman and Josh Simon for encouraging us 
to publish our work, and for support along the way. 
The advice of our shepherd, John Sellens, and the 
comments of our anonymous reviewers, were invalu- 
able in shaping our final paper. 


References 


[1] Postel, Jon, “RFC 706: On the Junk Mail Prob- 
lem,” November 1975. Network Working 
Group. 10 April 2002, http://www.faqs.org/rfcs/ 
rfc706.html . 

[2] Lee, Jennifer B., “Spam: An Escalating Attack 
of the Clones,” New York Times, June 27, 2002. 

[3] Gomes, Lee, ‘How Hotmail Keeps Its Email 
Empire From Spam’s Clutches,” Wall Street 
Journal, July 8, 2002. 

[4] Gartner Consulting, “ISPs and Spam: The 
Impact of Spam on Customer Retention and 
Acquisition,” June 14, 1999. 

[5] ‘‘Mail Abuse Prevention System LLC,” http:// 
mail-abuse.org/rbl . 

[6] Bernstein, Dan, “gmail home page,” 
http://www.qmail.org . 

[7] Varshavchik, Sam, “MAIL 1.01 unified Anti- 
UCE/Mailbombing patch,” http://portofhoodsport. 
org/qmail/misc/uce.htm] . 

[8] Postel, Jonathan B, “RFC 821: Simple Mail 
Transfer Protocol,’’ August 1982. Information 
Sciences Institute: University of Southern Cali- 
fornia, 20 April 2002, http://www.ietf.org/rfc/ 
rfc0821.txt. 

[9] Bernstein, Dan, ‘The rb/smtpd program,” http:// 
cr.yp.to/ucspi-tep/rblsmtpd.html . 

[10] Bernstein, Dan, “‘ucspi-tcp home page,” http:// 
cr.yp.to/ucspi-tcp.html . 

[11] Bernstein, Dan, “The tcpserver program,” http:// 
cr.yp.to/ucspi-tcp/tepserver.html . 

[12] Showalter, T., ‘Sieve: A Mail Filtering Lan- 
guage,” January 2001. Network Working Group, 
14 April 2002, http://www. ietf.org/rfc/rfc3028 txt . 

[13] Hughes, Craig R., ““SpamAssassin home page,” 
http://www.spamassassin.org . 

[14] Ugen J. S. Antsilevich, Poul-Henning Kamp, 
Alex Nash, Archie Cobbs, and Luigi Rizzo, 
“Manual page for ipfw — IP firewall and traffic 
shaper control program,”  http://www.freebsd. 
org/cgi/man.cgi?query=ipfw . 


19 


“iJ i , 





Yew é 3x) hy 
e <=it dee 
: \ ' ' 
‘ ' 
a 
rv / 
' : . 
; ' 
a? ’ 
’ ‘ ‘ 
' 1 ' ' 
'; 
ep ) San v1 S 
j i ‘ 
| a a ae 
‘ 
’ | ; bene 
j i 
‘ i a 
: ‘ 
id ’ 
‘ . mY? yi 
' 
‘ . 
( a 
t , 
A 7 / 
il ‘ * 
: i 
4 ’ " ’ ‘ 
: 7 i” ip 
% | 
. t _ . 
npc > : 
i ' 
i a : ' 
3 z ie ive 
' » ’ ! 
‘ ‘ J ct 
; as ‘ i ' 
“ ' | j ; ‘ ! 
‘ } 
‘ if - f 
ih ag n" 
/ ® ' ; 
iin wrest 
| v 
- 
t < . pit? 
, ytd rt ayy 
¢ it ‘ 7 
j ' ue 
J 7 
sett ll ‘ ‘ ' 
oa ian 7 : ' 
a) ‘ 
e 3 
no 4 ' 4 i AACS ' 
oa ii > * oe 4% 
“ ¥ ; 
4 sare 
; : 4 ibie 
oa 
Ars «Ve . 
at ‘* ; 
’ ’ 
Sab i 
4 : : 
6 Pei em a 
i ‘ §,¢ i ei 
Fy | epee T 
if 4; “ 4 vu 
~~ 
- ; o 
‘ ' 
* 4* } 
; 
; ’ y Pah) , Finiy 





Holistic Quota Management: The 
Natural Path to a Better, More 
Efficient Quota System 


Michael Gilfix — Tufts University 


ABSTRACT 


Disk quota systems exist to protect a limited resource and ensure that users can share it. 
However, existing quota management systems concentrate on controlling user privileges, rather 
than protecting resources. This paper suggests a new management model based upon a holistic 
view of resources and their controls. By acting upon a resource globally rather than upon 
individual users, the new approach exposes trends, allows for better resource planning, and allows 
for easy understanding of the impact of changes on a user’s ability to accomplish work. A new 
tool ‘Qualm’ is the first component of a new system for dynamic resource management that allows 
for decisions based not upon fixed limits for resource usage, but upon limits that change with 


usage patterns and demand. 


Introduction 


Efficient quota management is difficult. As one 
frustrated admin put it: “I’ve been making noise for 
the last couple of years that I think we should increase 
(at least double, probably more) the default quotas, but 
the analysis required to figure out where to increase 
them has been scaring me.” This frustration is a prod- 
uct of the limitations of current approaches to quota 
management; while viewing and modifying quotas for 
individuals or a segregated sub-section of the user 
population is relatively simple, it remains difficult to 
ascertain current file system usage, determine where 
change is needed, and assess the impact of a change. 


This paper presents a new model of quota man- 
agement designed to address these shortcomings. The 
model approaches the problem holistically; rather than 
focusing on privileges for individual users, it focuses 
on relative resource share and the global effect of 
change. The result is a paradigm where only changes 
that assure the integrity of the underlying resource are 
valid. 


Using this paradigm, a new tool, ‘Qualm,’ was 
created. Qualm works on top of existing quota systems 
to provide a simple means of performing global analy- 
sis, as well as a framework for making quota changes 
in a global context. Qualm employs a_ holistic 
approach, using multiple graphical formats to display 
the state of the entire quota system. These displays 
allow for quick assessment of the state and efficiency 
of the quota system at a glance, and provide a means 
for global manipulation of the underlying system. 


The Existing World of Quotas 


The current most popular, freely available quota 
management tools, such as the UNIX quota [8] utility 
and the NT quota system [12], solve the problem of 
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quota management with usage limits for individual 
users. Users are given two kinds of limits on the 
amount of disk space they may use: hard and soft lim- 
its. Hard limits are absolute and can never be 
exceeded. Soft limits may be exceeded by the user, but 
the user must subsequently lower his disk usage below 
the limit within a given time frame or suffer a conse- 
quence. These tools also provide limited facilities for 
the creation of user groupings, where a user inherits 
the quota limit of his group. However, these group 
mechanisms offer little benefit beyond the ability to 
administer the quota level of multiple users in one 
place. 


These tools, however, suffer from other severe 
limitations. Changes to the quota system are singular, 
meaning they are made without regard to the global 
state of the quota system and the underlying storage. 
Moreover, these tools offer no easy way of assessing 
the current state or efficiency of the quota system, thus 
rendering global decisions difficult. 


Commercial products targeted at the Fortune 
1000 genre, such as Precise Software solution’s Stor- 
ageCentral SRM technology [15] (which will soon be 
appearing in a future version of Windows, thanks to a 
strategic alliance with Microsoft) offer a step up over 
their freely available counterparts by bringing the kind 
of flexibility that one would expect from an enterprise 
solution. Using Precise’s software, the system admin- 
istrator can set up five different kinds of quota limits, 
better divide users up into service groups, and gener- 
ate several kinds of reports, from current space usage 
breakdown by file type, to user or group usage reports. 
The software even provides a facility for some basic 
trend analysis: a system administrator can view the 
usage history for an individual user or a disk. 


Nonetheless, this approach still focuses on set- 
ting individual user limits. In addition, there is no easy 
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way to visualize the distribution of the quota system in 
its entirety, or act upon the quota system in that con- 
text; all changes are still delegated at the user-level 
and those changes are made to user quota limits, 
regardless of the true state of the underlying disk. 
Finally, gaining the benefits of Precise’s StorageCen- 
tral SRM requires a complete and costly switch over 
to their quota management suite. While another one of 
Precise’s products, QuotaAdvisor [14], is much more 
light-weight, much of the benefits of the complete 
quota management suite are lost. Much was done dur- 
ing the creation of Qualm to avoid these adoption 
issues and make the integration of Qualm into the 
existing quota system relatively simple. 


In Search of Inspiration 


To overcome the limitations of the existing quota 
management solutions, a new model of quota manage- 
ment was needed. Recent work by Mark Burgess [3] 
provided an appealing start. Burgess suggests that a 
system quota is an inefficient strategy for managing a 
dynamic resource such as disk space. He then empha- 
sizes the importance of global knowledge when select- 
ing an optimal strategy and concludes, “‘A quota strat- 
egy can never approach the same level of productivity 
as one which is based on competitive counterforce.” 


While Burgess’ work uses game theory [11] to 
explore this competition as one between the system 
administrator and individual users, it ignores an 
important part of this dynamic interaction: the compe- 
tition between individual users of the system to meet 
their own needs above everyone else’s. Accordingly, 
the spirit of Burgess’ work was incorporated into the 
core philosophy of the new quota management model 
by emphasizing users’ relative share of a resource, 
rather than individual limitations on resource quanti- 
ties. This fosters an element of competition: since 
users are alloted a percentage of the resource, an 
increase in their allotment can only come at the 
expense of another user. 


Burgess’ work, unlike some other work in con- 
vergent system administration [1, 2, 6, 17], treats pol- 
icy as a mutable thing that must be changed and tuned 
for optimal performance. This led to the idea of 
attempting to model the quota management problem 
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as a problem in control theory, a branch of mathemat- 
ics that is often used to model electro-mechanical sys- 
tems in Electrical Engineering [4]. A possible feed- 
back model for a quota system inspired by control the- 
ory is shown in Figure 1. In this model, the system 
administrator defines an operating curve (OC) that 
describes how the system will respond to disturbances 
from a user, d(t), to the system. The control function, 
as defined by the OC curve, then affects the output of 
the system, which when combined with the continual 
input of user activity, causes the system to converge to 
the new desired state. 


While the control-theoretic model offered an 
interesting new way to be able to graphically control 
how the quota system enforced quotas and responded 
to aberrant disk usage, it lacked the ‘“‘global knowl- 
edge for decision making” that Burgess’ work empha- 
sized. In the end, both these ideas were merged in the 
formulation of the newly adopted model of quota 
management. 


Finally, there was the challenge of creating an 
interface that best represented the global state of a 
quota system. My work in creating Peep: The Network 
Auralizer [9] demonstrated that when digesting a con- 
siderable amount of information, the value of the 
whole can be greater than the sum of its parts; it is 
more important to convey the general state and trends 
of the quota system, rather than the individual values 
that comprise the system. 


In addition, Alva Couch’s work on visualization 
of large execution environments when developing 
seeplex [5] and xscal. xscal provided inspiration for 
developing Qualm’s scalable graphical environment. 
The algorithms used in xscal for visualizing very large 
numbers of data points proved very relevant in visual- 
izing quotas; a graphical display of an entire quota 
system needs to be able to plot thousands of data 
points on a single display. 


A New Model for Quota Management 


The new model for the quota system combines 
elements from the competitive model and the control- 
theoretic model in a novel way. Rather than the tradi- 
tional approach of viewing quota as assigning a 


System 
Output 


Figure 1: A control theory feedback model for a quota system. 


22 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


Gilfix 


specified amount of disk space to an individual or 
group of users, all available disk space is treated as a 
continuous, finite resource, where each user is alloted 
a fraction of that resource (a simplification which is 
valid because the smallest unit of measurement, the 
byte, is very small compared to total the resource 
size). The system administrator then defines a distri- 
bution curve that describes how the administrator 
thinks the disk resource should be shared, i.e., a cer- 
tain user population should get more than another user 
population. This distribution has the constraint that the 
total area under the distribution curve must be equiva- 
lent to the size of the storage resource. Finally, a sepa- 
rate mechanism configured by the system administra- 
tor determines user privilege, i.e., where a user gets to 
sit on the distribution curve. 


Figure 2 shows how the model works graphi- 
cally; this distribution curve allots more space to a 
sub-section of the population and tapers off for the 
general population. The X-axis of the graph is then 
divided up into equal intervals amongst the n-user 
population, where each user is given a portion of the 
area under the distribution curve. The user’s area then 
translates into a portion of the storage resource. In this 
model, the sum of disk space allocated to each user is 
equivalent to the total storage size, or more formally: 

n 


‘| d(t) dt= > Au; = Total Storage 
i=l 


0 

where n is the total number of users in the quota sys- 
tem, d(t) is the disk space distribution curve, and Au; 
is the amount of disk space allocated to each user. Fig- 
ure 3 shows a distribution curve that could be used to 
implement service levels in existing quota systems 
under the new model. Here, each step on the distribu- 
tion curve represents a different quota limit, and its 
relative length indicates the percentage of the user 
population have that limit. 


Just as in traditional quota systems, a user can 
use any amount of disk space up to the maximum 
alloted by their position on the distribution curve. 
Because of the constraint on the area of the distribu- 
tion curve, increasing the fraction of disk space for a 
single user reduces the amount of disk space alloted to 
every other user. Only changes to a certain user’s 
allotment that do not infringe upon the actual usage of 
other users (barring explicit action from the system 
administrator) are considered valid. 


The new model offers an important advantage: a 
user’s disk space is decoupled from the actual size of 
the resource. Thus, if the size of the resource changes, 
the user’s alloted space changes while his relative 
share remains the same. Taxation of the resource is 
also reflected by this approach: increasing the number 
of users (in this case 1) automatically decreases the 
amount allocated to each user. Still, the user’s relative 
privilege remains the same throughout both changes. 


This approach also helps separate policy from net- 
work implementation; the distribution curve describes 
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what the system administrator believes proper space 
allocation should look like, given the needs of his users 
and the amount of resources available, while the privi- 
lege mechanism determines where specific users fit into 
the system administrator’s master plan. 


Disk 
Space 
Au 

do 

n users 

Figure 2: The new model illustrated. 
Disk 
Space 
d(t) 
n users 


Figure 3: An example of a current quota implementa- 
tion under the new model. 


The model agrees closely with what system 
administrators currently try to implement as network 
policy, and incorporates Burgess’ idea of competition 
as a way of maximizing resource efficiency. Here, the 
distribution curve can be thought of as the system 
administrator’s optimal strategy: a user is given a 
share of the resource by the system administrator that 
reflects his given need (as determined by the level of 
privilege granted by the system administrator) while 
remaining in accordance with the administrator’s gen- 
eral strategy. Giving a single user a larger share of the 
resource comes at the expense of all other users in the 
system, which is true of any finite resource, and cap- 
tures the essence of competition. Consequently, 
changes are made to a global model, where each 
change affects the system as a whole, rather than a 
localized and segregated part. 


This has interesting consequences: in the model, 
the system administrator is no longer trying to limit 
users but is instead trying to protect and maximize a 
finite resource. Only changes that maintain the 
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integrity of the storage and respect other users’ use of 
it are valid. 


Towards an Implementation 


Some concessions and design constraints were 
needed to create a successful, adoptable implemen- 
tation. Above all, the tool needed to be capable of 
inter-operating with current quota implementations. 
To meet this requirement, a decision was made to 
push aside the implementation of the privilege 
mechanism and focus on the analysis and global 
manipulation of the quota system. The tool could 
then use the existing quota system for its back-end 
operations, while providing the user with an interface 
that best agreed with the new model. 


‘Qualm’ was created as a compromise between 
the new quota management model and the inter-oper- 
ability requirement. The goal in creating Qualm was to 
create an interface that allowed the system administra- 
tor to quickly and easily visualize the state of his 
quota system, assess problem areas, and make global 
changes. A flexible interface was also important; if the 
new interface were to replace the existing quota man- 
agement interface, the system administrator would 
need to be able to tailor the displays to reflect his par- 
ticular needs. 


In order to form a scalable view of the state of the 
quota system that faithfully conveys the global distribu- 
tion of the data and its interrelationships, several features 
of xscal’s display model were employed. Building upon 
the graphing techniques used by xscal enabled Qualm to 
avoid many of the performance problems inherent in 
generating plots for large data sets. Sorting of the data 
was used to expose trends and groupings inherent in the 
data set. Qualm moves beyond xscal’s abilities in this 
regard by allowing for multiple data fields to be dis- 
played on a single plot, and allowing for hierarchical 
sorting. Using hierarchical sorting, secondary fields can 
be sorted with regard to the result of sorting their parent 
fields, exposing categories within the data, and trends 
within those categories. 


Qualm’s displays use a sum of step functions to 
depict transitions between values. Because of the large 
number of data points used in creating the display, a 





Figure 4a: A Plot at full view. 
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“continuum effect” occurs and these step functions 
appear to form a smooth distribution curve. Upon 
closer inspection (using zooming), the steps become 
more apparent. The step function provides a nice 
emphasis on the values and transitions between data 
points without introducing any graphical artifacts. 
However, the step function is not always ideal when 
plotting multiple fields on a single graph. In such 
cases, Qualm provides alternate display mechanisms, 
such as error bars that extend away from the primary 
plot, or scattered data plots. 


Furthermore, Qualm’s capabilities are modular 
and extensible. Access to the underlying resource, 
such as quota data or a flat file, is implemented as a 
module, so Qualm can easily be extended to work on 
top of other resources. In addition, Qualm’s graphing 
library provides the system administrator with the 
tools needed to create new graph types or extend exist- 
ing graph types, so displays can better be tailored 
towards the administrator’s need. 


Performing Analysis with Qualm 


Running Qualm on the Tufts University EECS 
network produced some remarkable results. Figure 5 
shows a display of block usage for all EECS users. Val- 
ues of numbers of blocks used appear on the vertical 
axis and range over all users of the EECS network on 
the horizontal axis. The axes are sorted in increasing 
usage values from left to right. The 1-Dimensional fre- 
quency plots at the bottom and left side of the graph 
indicate the frequency of transitions between values in 
the main plot, where each hash mark represents a tran- 
sition. The display yields an interesting result: block 
usage appears to follow a Pareto [13] distribution. A 
plot of file usage followed the same distribution. Even 
more interesting was that this distribution remained 
constant over a period of six months! This strongly 
agrees with Michael Mitzenmacher’s dynamic model of 
file sizes [10] and reaffirms that file sizes in large net- 
works are indeed statistically predictable. 


Next, one can plot quota hard limits against 
quota soft limits and block usage using a hierarchical 
sort (as shown in Figure 6). Once again, values appear 
on the vertical axis and range over all users on the 


Figure 4b: A zoomed sub-section 
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horizontal axis. This display allows quick assessment 
of how quota is distributed across users of the system 
and how active those users are. The soft quota looked 
as expected, showing the same distribution as the hard 
quota settings, except significantly lower. The shape 
of the quota limits indicate that EECS users tend to 
fall within four different categories, or four different 
levels of service. Interestingly, even though Qualm 
had no knowledge of user groupings, these groupings 
were implied by the sort order of the graph! Moreover, 
the block usage yields some very interesting facts: 
most of our users were given quotas well beyond their 
needs. At the same time, certain users are over their 
soft quota limits by a substantial amount and may 
require additional space. 


In order to better assess the effectiveness of the 
EECS soft quota limits, on-going usage data was accu- 
mulated over a period of several months and fed into 
Qualm. Figure 7 shows a plot of soft limits and block 
usage with error bars extending from the block usage 
trace (color coded in reality) indicating the maximum 
deviation of that usage value over the course of the 
period. The plot indicates that users who were given 
larger soft limits tended to deserve it; the space 
requirements of those users fluctuated more than any 
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other category. Other users have exceeded their soft 
limit at some point and might be good candidates for 
an upgrade to a higher service level. Finally, the large 
number of over-subscribed users suggests that perhaps 
with a little tweaking, a much more efficient configu- 
ration could be achieved, minimizing the cost of future 
storage upgrades. 


Resource Manipulation with Qualm 


A holistic approach to quota manipulation was 
adopted when creating Qualm: when making adjust- 
ments, rather than concentrating on the details, the 
system administrator tries to make the global picture 
“look right.” In the context of Qualm, this translates 
into making adjustments to the distribution curves on 
the soft limit and hard limit displays. 


An adjustment can be anything from lowering or 
raising a small sub-section of the curve, to radically 
changing the entire shape of the curve. In order to adjust 
the limit of the entire population that presently has a 
particular limit, adjustments are made by left-clicking 
on the distribution curve at the level of the existing 
limit, and dragging the plot up or down. The left-click 
adjustment affects all users with that same initial limit. 
Alternatively, the administrator can right-click and 


ocr 


Figure 5: A plot of block usage for all EECS users. 
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Figure 6: A sorted plot of hard limits (uppermost trace), soft limits (mid-level trace), and block usage (lower most) 
for all EECS users. 
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Figure 7: The soft limit (uppermost trace) and block usage (lowermost) trace from Figure 6, with block usage devi- 
ations extending from the block usage over time. 
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Figure 8b: Manipulation of EECS hard limits to give a higher ceiling to more active users in each usage category, 
Il. 
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select a sub-segment of the population with a given 
limit and adjust only the limits of that sub-segment. 


Operations in Qualm can only be performed on 
segments of the distribution curve, never on individual 
points. However, depending on the zoom-factor, those 
segments may represent any number of users from a 
large user population to a single user. A quick rule of 
thumb is that the smoother the curve, the more users 
affected by the operation. Here, zooming serves a dou- 
ble purpose: it allows the system administrator to get a 


Gilfix 


closer look at the underlying distribution and to con- 
trol the granularity of his changes. Additionally, the 
system administrator can still use the traditional quota 
mechanism if he deems zooming insufficient. 


Finally, changes to the distribution curve are 
only transfered to the underlying resource when the 
system administrator explicitly chooses to commit. 
This allows the administrator to make as many 
changes as he likes and take some time to examine 
them fully before letting the changes go live. It also 


<?xml version=’1.0’ encoding=’UTF-8’ standalone=’no’?> 


<configuration app=’qualm’> 


<!-- Configuration file for Qualm --> 
<options> 

<!-- Program options will go here --> 
</options> 
<resources> 


<resource type=’quota’ name=’localhost’ host=’localhost’> 


<source type=’module’ module=’File’> 


<args> 
<path>history.dat</path> 

</args> 

<headers> 
<header>Login</header> 
<header>Time</header> 
<header>Files</header> 
<header>Hard Limit</header> 
<header>Soft Limit</header> 
<header>Blocks</header> 


<header>Hard Limit (Blocks) </header> 
<header>Soft Limit (Blocks) </header> 


</headers> 


<plot type='’FrequencyPlot’ name=’Block Usage’ ylabel=’Block Usage’ 


<field>Blocks</field> 
</plot> 


xlabel=’Users’> 


<plot type='’DotFrequencyPlot’ name='’Hard Limit/Soft Limit/Blocks’ 
ylabel=’Hard Limit (Blocks)’ xlabel=’Users’ sorted=’1’> 
<field>Hard Limit (Blocks)</field> 
<field colour=’BLUE’>Soft Limit (Blocks)</field> 
<field colour=’DARK GREEN’ >Blocks</field> 


</plot> 


<plot type='’HistFrequencyPlot’ name=’History Soft limit (Blocks)/Blocks’ 
ylabel=’Soft Limit (Blocks)’ xlabel=’Users’ sorted=’1'> 
<field>Soft Limit (Blocks) </field> 
<field colour=’DARK GREEN’ >Blocks</field> 


</plot> 


<plot type=’DotFrequencyPlot’ name=’Soft Limit/Blocks’ 
ylabel=’Soft Limit (Blocks)’ xlabel=’Users’ sorted='1’> 
<field>Soft Limit (Blocks) </field> 


<field>Blocks</field> 
</plot> 
</source> 
</resource> 
</resources> 
</configuration> 


Figure 9: An example Qualm configuration file. 
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avoids a rather expensive operation until absolutely 
necessary; a single change could easily affect thou- 
sands of records. 


A Generalized Configuration Format 


Qualm uses an XML configuration file to deter- 
mine how data should be retrieved from a resource, 
labeled, and displayed. Figure 9 shows an example 
configuration file that fetches quota system data from 
a flat file and creates the four displays used for analy- 
sis in a previous section. 


The source tag indicates the data source; Qualm 
uses resource modules as data proxies. Currently, 
Qualm supports two types of resource modules: a 
resource module for interacting directly with the quota 
system and a resource module for reading data from a 
flat file. The args tag contains parameters which are 
passed to the module during intialization. The header 
tags indicate how to label the data fields, which may 
be fields delimited by spaces in a flat file or a list of 
fields in memory. 


The plot tags tell Qualm which displays to use 
and how to generate displays when plotting resource 
data. The type attribute indicates what kind of plot to 
generate and corresponds to an existing plot object 
within Qualm’s graphing library. The ylabel and xlabel 
attributes tell Qualm how to label the axes. The sorted 
attribute indicates whether Qualm should use hierar- 
chical sorting when generating the plot. If the sorted 
attribute is set, Qualm uses the listed order of the 
fields as the hierarchical sort order. In the ‘‘DotFre- 
quencyPlot” example, the hard limit block usage is 
sorted first, the soft limit block usage is then sorted 
with respect to the hard limit, and finally the block 
usage is sorted with respect to the soft limits. 


Using this configuration system, a system admin- 
istrator can add displays or tweak existing displays to 
meet his needs by simply adding or modifying exist- 
ing plot tags. Qualm can also load and display data for 
multiple resources simultaneously by providing multi- 
ple resource tags. Adding a new resource tag adds 
another page in Qualm’s tabular interface, making it 
easy to look at two different resources from two differ- 
ent locations at once. 


Critique 


Differentiating between users who may belong in 
separate groups, but who still have the same quota 
requirements, is impossible under the current imple- 
mentation. In the EECS network, we often have stu- 
dents in multiple classes who are given the same disk 
quotas. Currently, these students fall under the same 
category, even though it might be convenient to keep 
them separated for non-quota reasons. 


A common example is having two students in 
different classes, a student in a data structures class 
and another in a programming languages class, with 
the same quota allocations. Under the current 
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implementation, these students become indistinguish- 
able. Both students will lie on the same usage line in a 
block usage plot, and worse, the relative order of their 
positions on that line is determined by the data sorting 
algorithm and thus may fluctuate! The lack of differ- 
entiation makes modifying quotas for students in a 
particular class difficult. 


The ideal solution to this problem lies in the 
implementation of the privilege mechanism described 
in the new quota management model, and thus a com- 
plete implementation of the model. Such a mechanism 
would need to be able to understand user classes and 
groupings. This would allow Qualm to better tailor its 
displays to reflect user grouping (perhaps via color cod- 
ing or a similar method), keep those groupings together 
during the sorting process, and allow the administrator 
to manipulate groupings directly via the display. Such a 
mechanism would give the administrator the finer grain 
of control needed to solve this issue. Ultimately, the 
system administrator can still use the traditional quota 
system to make changes to specific users. 


As with all systems that attempt to increase the 
efficiency of resource allocation, there is a question of 
how the system can be defeated. While the analysis with 
Qualm might suggest a more efficient quota configura- 
tion and greatly simplify making those changes, it does 
punish users who conserve disk space so that they might 
have that space immediately available when they truly 
need it. It encourages the mentality to “hoard now, so 
that I may hoard later,” a mentality that is well under- 
stood and often employed (sadly, successfully) in the 
corporate budget world. This problem, however, exists 
within current quota management systems, and its causes 
are mostly social factors. Still, because Qualm encour- 
ages a more dynamic management model, the influence 
of these social factors are more significant. The consola- 
tion is: while Qualm emboldens change, it also makes it 
easy to reverse change. 


Towards the Future 


As Qualm is still in the prototyping stage, the 
most pressing need in Qualm’s current stage of devel- 
opment is user feedback and suggestions. The next 
phases of Qualm’s development will involve working 
out the kinks and making Qualm a truly usable tool. 


Constraining the manipulation of quota levels to 
keep the area under the distribution curve constant has 
not yet been implemented at the time of writing. Such 
a mechanism would play an important role in keeping 
quota modifications in check; the current interface 
makes it very easy to lose track of the real scope of a 
change and make unreasonable changes to quota lim- 
its. However, it is crucial for the mechanism to sup- 
port the concept of over-subscription for inter-oper- 
ability; existing quota systems greatly over-subscribe 
storage. Over-subscription may also be desirable for 
providing users with a bit of breathing room, while 
keeping a preferable space distribution. 
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Much future work involves the full implementa- 
tion of the model; creating an implementation of the 
privilege mechanism so that users can easily be 
grouped into classes and different service levels, while 
keeping the distribution of the resource separate. The 
privilege mechanism should help solve the most press- 
ing problem with the current Qualm implementation 
of being able to differentiate classes of users by pro- 
viding the system administrator with a finer grain of 
control over how users fit into the quota distribution. 
Such a mechanism would also make it easier to man- 
age the exceptions that do not fit well into the new 
quota management model by allowing for a more clas- 
sic quota management functionality. 


Finally, the mechanisms used in Qualm and their 
implementation are sufficiently flexible that Qualm can 
be used to analyze any type of network resource with 
similar properties to disk quota (such as bandwidth, for 
instance). Future research will explore other areas 
where the principles used in Qualm may be applied. 


Some Lessons Learned 


Despite being a fundamental component to mod- 
em service networks, disk quota management has 
changed little since its inception. The traditional focus 
on controlling user privilege has introduced problems 
of scalability, making analysis and global manipula- 
tion difficult. This paper advocates a new model of 
quota management that solves these scalability issues; 
by adopting a graphical approach that emphasizes 
resource share and distribution, the tasks of determin- 
ing where change is needed, assessing the impact of 
that change, and assessing the efficiency of the current 
quota scheme, are greatly simplified. 


Use of Qualm on the Tufts University EECS 
network has contributed significantly to an under- 
standing of the quota system, and what adjustments 
are necessary to make it optimally efficient. For the 
most part, quotas don’t affect the usage patterns of 
the majority of the students; most of the storage allo- 
cated to those students will probably never be used 
and that space might be better allocated to users in 
higher quota brackets, who tend to place a greater 
demand on the file system. By graphing patterns of 
usage over time, it was also easy to determine which 
users were restricted by their quotas and which users 
were over-subscribed. 


Finally, usage of Qualm has yielded some valu- 
able insight into the nature of our quota systems. It 
suggests that quota management by category is more 
effective than artificial user groupings. These cate- 
gories emerge from similarities found within existing 
user populations, and provide a better model for con- 
trolling global user quotas. Moreover, the strong 
resemblance of file size distribution to a Pareto distri- 
bution suggests that exploiting this knowledge may be 
the key to building an automated and optimal dynamic 
quota system in the future. 
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Availability 


Qualm is freely available, open source software. 
The project is currently under development and I am 
actively looking for people to get involved with the 
project, experiment with it, and provide feedback. 
Qualm is written in pure Python on top of wxPython 
[18] and makes use of the pyquota [16] module. To 
obtain qualm, please visit: http://qualm.sourceforge.net. 
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Application Aware Management of 
Internet Data Center Software 
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ABSTRACT 


We have built a comprehensive solution to address the management aspects of deployment and 
analysis of applications in Internet Data Centers. Our work was motivated by the high total cost of 
ownership of operating such centers, largely due to the variety of applications and their distinctive 
management requirements. We have chosen an approach that encapsulates application specific 
knowledge (is application aware) and deployed it in a number of corporate Internet Data Centers. 
Operations staff found substantial cost reduction in managing applications using our approach. 


Introduction 


A corporate Internet Data Center consists mostly 
of Web server farms, application server farms, and 
database servers. Frequently there is a heterogeneous 
server environment running Windows 2000, Solaris, 
Linux, and AIX platforms, often due to corporate merg- 
ers and acquisitions. Acquired third-party software can 
include (1) Web servers such as Apache, iPlanet 
(SunONE), Microsoft IIS, (2) application servers, such 
as BEA WebLogic, IBM WebSphere, Microsoft MTS, 
iPlanet (SUNOne), and (3) database servers such as 
Oracle and Microsoft SQL servers. In addition, enter- 
prises have a large quantity of in-house developed soft- 
ware (J2EE applications, ASP/JSP pages, COM(+) 
components, etc.) which need to be deployed and con- 
figured on top of the third party software. 


There has been excellent prior work on infras- 
tructure deployment and management (see the pio- 
neering paper [TH98] and references therein, such as 
[R97], or books, such as [LH02], [B00]). These solu- 
tions (and some of their tools, e.g., JumpStart, Kick- 
Start, Ghost, SUP, etc.) mostly focus on more generic, 
lower level aspects of the infrastructure, such as OS 
and standard network servers (DNS, mail, etc.). We 
advocate building on these existing solutions and at 
the same time creating new management technologies, 
which are application aware. By this we mean that 
instead of operating only on the level of files and 
directories, such solutions capture knowledge about an 
application such as its configuration methods, require- 
ments, and more. Operations personnel can use such 
knowledge to define methods to deploy, install, 
upgrade, and start applications. Once defined, these 
methods can be executed repeatedly and reliably 
through a “‘push” method or through a “‘pull”’ method 
(what might be done today using tools such as “‘rdist” 
or “‘cfengine”’ [CF02], respectively). 

It is important to note that the approach of appli- 
cation-aware management technologies extends well 
beyond the realm of the Web applications in Internet 
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Data Centers. However, for the sake of concreteness 
and because of the importance of Web applications, 
this paper focuses on our efforts to create solutions for 
Internet Data Centers. 


In the next section, we introduce Application 
Management as an emerging and important field of 
system administration and then motivate our approach 
of application-aware management. Subsequently, we 
present the required building blocks of our approach 
and shows how they interact with each other to form a 
comprehensive solution. The next section shows some 
of the technologies needed to realize the building 
blocks. After that, we quantify some of the benefits 
seen by operations people using application-aware 
technology and finally conclude. 


Importance of Web Application Management 


In every industry, business is moving online. After 
migrating business information to databases in the eight- 
ies and setting up e-commerce applications in the 
nineties, organizations of all sizes are relying on web 
applications for core business operations. With the advent 
of Web services, this trend will only be reinforced. 


Application management today operates on file, 
directory, and configuration parameter levels. As most 
operations managers can attest, application changes are 
frequently hectic ordeals, involving custom-built scripts 
(sometimes written only moments before they are first 
used), best guesses about hardware and software con- 
figurations, remote locations, late hours, and over- 
worked staff. Changes can involve complex combina- 
tions of commercial software, in-house applications, 
and custom scripts. Operations staff is responsible for 
understanding the requirements for all these compo- 
nents, deploying them quickly and flawlessly, and 
remembering what they have done at a detailed level. 


Apart from the actual deployment, application 
management includes other substantial challenges. It 
is often important to detect what has changed and by 
whom in deployed applications (e.g., the administrator 
on duty at the time of an emergency made a quick fix 
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to some Web server configuration file/database, which 
is not consistent with the overall policy). Similarly, 
operations staff needs to quickly pinpoint deployment 
and configuration differences among two servers. 
Severe reliability problems often lead operations staff 
to undo and rollback application deployments (we 
note that not every application can be cleanly rolled- 
back due to its possible side effects, such as changing 
the OS state). 


Cost of Managing Web Applications 


Below are the summary points of a representa- 
tive Internet Data Center environment. The associated 
cost of managing application deployments and analy- 
sis in such an environment is discussed later. 

e 120 servers, 26 applications 

e Applications are all running on top of Apache 
Web servers and either Weblogic or WebSphere 
application servers. 
Collecting application changes and deploying 
them once a week. 
IBM consulting project to document manual 
change processes did not result in any improve- 
ments in quality or in cost. 
Costly, time-consuming errors, such as running 
out of disk space during an application deploy- 
ment or forgetting to add required database 
tables during an n-tier application update. 
e Pain Manifestation: “‘My team of seven spent 

16 hours on one WebSphere deployment”’ 


Motivating Application Awareness 


Here we present some typical workflows involving 
applications both on the UNIX and Windows platform. 


J2EE Applications 


J2EE application servers, such as Weblogic, 
WebSphere, and iPlanet (SunOne) implement their 
own “logical topology” on top of the network of 
physical server machines. A “‘server instance” is a 
software component that makes one or more J2EE 
applications available on a physical machine. Each 
physical machine may host one or more server 
instances. Each product has its own way of creating, 
grouping, and managing its server instances. 


J2EE applications (e.g., Enterprise Java Beans, 
EJB’s) implement the core business logic (second tier 
in an n-tier architecture) of most Web-enabled applica- 
tions. These applications are typically packaged in 
archive formats, such as EAR (Enterprise Archive) or 
WAR (Web Archive). These archives contain configu- 
ration information in XML. Each vendor extends the 
basic XML configuration, affecting the way the J2EE 
application is deployed to the application server. Oper- 
ations personnel often get these archives from the 
development organization and might have to open the 
archive to edit configuration values. The deployment 
of these applications to a server instance always has to 
be effected via the responsible administrative console 
server. Details of the deployment commands differ 
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among vendors and even among versions from the 
same vendor. For example, the configuration informa- 
tion for each J2EE application running on WebSphere 
is stored in a centralized database, which is read and 
written by the administrative console during each 
deployment, upgrade, or other change. 


Closely coupled with J2EE business logic are the 
Web applications, the first tier in an n-tier architecture. 
These applications are deployed onto Web servers 
(Apache, iPlanet, etc.) and contain JSP pages, static 
content, and more. They are often deployed together 
with the second tier as a single logical software com- 
ponent, forming cross-machine dependencies. Further- 
more, the Web servers need to be configured to con- 
nect to the appropriate application server instances. 


Windows Applications 


On the Windows platform, IIS is the dominant 
Web server platform and MTS is the dominant plat- 
form for the business logic applications. Deployment 
of applications (ASP pages, ISAPI filters, etc.) onto 
IIS (versions 5.x) require each configuration to be 
reflected in the “metabase,” a registry like database 
resident on each Windows 2000 server. Any configu- 
ration information about IIS itself also has to be 
reflected in the metabase. On the Windows .Net server 
and IIS version 6.x, the metabase is realized as an 
XML file. Business logic and transactional compo- 
nents are typically packaged as COM or COM+ com- 
ponents. Deployments onto MTS require the registra- 
tion of these components with the Windows registry 
database. Any Web or business logic application might 
require additional manipulation of the registry 
database. Within the .Net framework, applications are 
packaged as assemblies, which have yet another 
deployment philosophy. 

Database Dependencies 


Most applications require access to database 
servers running Oracle, SQL, or similar. While 
deploying and configuring such servers might not be a 
frequent operation (relative to changes to applica- 
tions), the deployment of an application typically does 
require some manipulation of the database (e.g., 
tables, stored procedures). 


Need for Application Awareness 


Given the description of Internet Data Center 
software above and the illustration of Figure 1, it 
becomes evident that deploying and managing appli- 
cations requires a lot of application specific knowl- 
edge. This is unlike basic infrastructure management 
of UNIX-based machines, which has been pioneered 
and described in [TH98] and the corresponding tools 
for UNIX or Windows (JumpStart, Ghost, isconf 
[TH98], cfengine [CF02], etc.). There is an unmet 
need for technology and solutions to automate the 
deployment and management process beyond host and 
OS management, and beyond pushing or pulling files 
and directories of applications. 
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Application Aware Management 


Application aware management requires technol- 
ogy that can (1) capture knowledge about an applica- 
tion, such as its installation, registration, configuration 
methods and its dependency management and (2) 
automate processes based on this knowledge. In the 
following, we present some of the key components of 
such an approach, building upon known technologies 
of [TH98], such as central version control, Gold 
Server, and basic server (host and OS) set-up tools. 
See also Figure &2, which summarizes the building 
blocks described below, some of the building blocks 
introduced in [TH98], and their dependencies. Some 
of the building blocks in [TH98] have been merged 
under ‘“‘Host&OS” management, which can be argued 
is a pre-condition for application management as a 
whole. The building blocks above the fat line were 
introduced in [TH98]; we created the blocks below 
that line. Lines between building blocks denote that 
the block below depends on the block above being 
realized in the system. The system we built includes 
all building blocks except the ‘“‘Host&OS”’ block. 

e An Application Model is a data-driven repre- 
sentation of an application (including executa- 
bles and configuration files, content, etc.) and 
associated methods to deploy, configure, and 
analyze the application. A model has to be 
reusable, so that it can be applied to different 
server environments (e.g., staging vs produc- 
tion), which might require different deployment 
or configuration choices. Rather than capturing 


START 


“hard-coded”” configuration settings of the 
application, the model contains variables for 
such settings which the user can instantiate dur- 
ing the deployment process. 

e The Model Builder automates the creation of 
application models. It captures deployed appli- 
cations from a Baseline Server (e.g., a machine 
in the QA environment), checks them into a 
central version-controlled repository (‘Gold 
Server” approach), and at the same time creates 
a base model of the application. The model 
builder associates the type of the captured 
application (e.g., J2EE application on Web- 
Sphere, Web application on IIS) with an appro- 
priate base model, including methods for 
deployment, analysis, and discovery (see subse- 
quent building blocks). The model builder then 
allows customizing the base model by adding 
dependencies and configuration customizations 
(see subsequent building blocks) to the model. 
An illustration of workflow enabled by this 
building block is shown later. 

e The Deployment Manager automates the 
deployment process of an application end-to- 
end. For each checked-in and modeled applica- 
tion, this process assures that the application 
will be correctly installed on the desired server 
machines. In other words, the deployment man- 
agement module forms the runtime system for 
the modeled methods. 

e The Configuration Manager determines the 
desired configuration of an application according 
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Figure 1: The deployment and configuration of web applications is a complex process, where most steps are appli- 
cation dependent, meaning they differ from application to application (““QA” = quality assurance). 
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to the environment (e.g., number of CPUs, of an application on a server. See [MDEGGH00] 
database connector, and thread-pool of the target for a good introduction to model-driven applica- 
server). It then generates the configuration by tion discovery. 

setting values in appropriate text files, XML Figure 3 summarizes how the above components 
files, or modeled methods of the application. In map onto the deployment and analysis process. 


this way, the configuration manager can even 
write to database-like structures on the target 
servers, such as Windows registry, IIS 
metabase, WebSphere data stores, etc. 

The Dependency Manager ensures that all 
modeled requirements for a successful deploy- 
ment are met (including across different 
machines). This occurs before the deployment 
manager makes any changes to the target server. 
Automated Analysis pinpoints configuration, 
version, and other differences between data 
center servers. This detects the “‘out-of-band”’ 
changes, that are difficult to suppress in most 
Internet data centers, such as when an opera- 
tions person changes a configuration value not 


Figure 4 illustrates how these technologies fit 
into a system solution (which we call “‘CenterRun’’). 
The architecture consists of a master server and 
remote agents. This solution offers a centralized con- 
sole to the operations personnel. From this console 
(command line or Web GUI), applications can be cap- 
tured from Baseline Servers, application models can 
be created, and deployments can be executed as cap- 
tured in the application model. Every action is logged 
and archived. Installed applications can be analyzed 
and compared across servers. Below we describe the 
workflow as offered by the centralized console. We 
use two concrete and very different applications, a 
J2EE application on WebLogic (following the sample 
rs ; application ‘‘petstore,” one of Sun’s Java blueprint 
using any sanctioned tools, but rather by hand, applications) and an n-tier Web application on Win- 
for example during an emergency server recov- dows (following the Microsoft sample application 
ery procedure. ““FMStocks,” see http://www.fmstocks.com). 


application Uineowery. callers talormation én Workflows Enabled by Application Awareness 
what applications are already deployed on 


which server in the Internet data center. The Workflow 1: Deploy a J2EE Application on WebLogic 
application model guides the discovery process, 1. The Model Builder captures the J2EE applica- 
as it knows what features indicate the presence tion on a Baseline Server: 


Version Control 


Gold Server 
Host & OS 


Model Builder 


Application Model 
Cer EN =~ 7 eee 
Deployment Analysis Dependency | Discovery 
cc tea caiacens _ 


Configuration 


aternmer Ewa 


Figure 2: Building blocks for application management. 
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a. Recognizes the application as a J2EE 
application on Weblogic. 

b. Creates an application model, capturing 
and describing all the relevant files and 
archives, such as EAR and WAR. 

c. Checks these resources into the master 
server’s repository, where they are versioned. 

d. Captures the relevant configuration infor- 
mation from the Weblogic XML file, 
“config.xml.” 

e. Uses predefined models to add all the nec- 
essary methods to install and uninstall the 
application to the newly created applica- 
tion model. 

2. The Dependency Engine executes checks, such 
as whether Weblogic 6.0 or higher is actually 
installed and running, and whether the corre- 
sponding Web servers are configured to connect 
the Weblogic server instances. It does so by 
querying the remote agent on the target servers. 

. The Deployment Engine parses the modeled 
methods for the J2EE application. It then 
understands which server is the target of the 
deployment (server hosting the administrative 
console) and which command-line calls need to 
be made to the administrative console. It trans- 
mits the resources to the agent on the target and 
has that agent execute the command-line calls. 


START 
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4. The Configuration Engine determines the con- 


figuration values of the J2EE application. For 
example, the path of the application’s home 
directory on the WebLogic administrative con- 
sole depends on the WebLogic domain. This 
install path is modeled as a variable. The Con- 
figuration Engine generates the value for this 
variable by examining the configuration state of 
the previously installed J2EE server instance. 


It is worthwhile noting that Step 1 in the above 


workflow is typically executed once for each applica- 
tion. As a result, the application and the model is 
stored and version-controlled on the master server. 
Steps 2 and 4 are typically executed many times for 
many different WebLogic target servers. Also, it is 
very likely that different people within the operations 
staff execute Step 1 and Steps 2-4. No knowledge 
about Step | is required to kick off the sequence of 
automated Steps 2-4. 


STEP 3: 
Check out all 


required files and 
data from Gold 
Server and move 
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Workflow 2: Compare Two Deployed Instances of the 


Above J2EE Application 


1. The Analysis Engine parses the analysis meth- 


ods of the model for the J2EE application, 
which guide it to identify all the relevant con- 
figuration settings of the application in the con- 
fig.xml files of both servers. It then transforms 
these two files into two new, smaller XML 
files, containing only this relevant information. 
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Figure 3: Application aware components can automate the configuration and deployment process from end-to-end, 
using appropriate application-specific knowledge for each step. 
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2. The Analysis Engine parses this captured infor- 
mation, compares the two XML files to each the master server’s repository, where they 
other and then presents any differences in a are versioned. 
structured way to the user (e.g., the full name of f. Captures the relevant configuration infor- 
the Weblogic parameter is presented with each mation from the IIS metabase database and 
differing configuration value). the COM+ catalog and stores them in 

It is worthwhile noting that the steps in the above XML format. 

workflow use the methods created in Step 1 of Work- g. Uses predefined models for each resource 
flow 1. The operations person does not need any type (virtual directory, COM+ component 
knowledge about Step 1 of Workflow 1 in order to and SQL script) to add to the newly cre- 
execute the above two steps. ated model all the necessary steps to install 
and uninstall the application. Figure 6 
shows the resources of the completely 
checked-in FMStocks application. The 
selection of the virtual directory in Figure 
5 resulted in two resources, the virtual 
directory tree (FMStocks) containing con- 
tent, ASP pages, etc. and the correspond- 
ing configuration data (FMStocks.xml), 


e. Checks the resources of b), c) and d) into 


Workflow 3: Deploy an N-Tier Web Application on 
Windows, Consisting of an IIS Virtual Directory, 
COM+ Components, and a Database 


1. Guided by the user, the model builder captures 
the Web application on a Baseline Server: 
a. Recognizes the application as a Web appli- 
cation on IIS; 
b. Creates an application model of the virtual 


directory, capturing and describing all the 
relevant content, ASP pages, and ISAPI 
filters. Figure 5 shows the screen, which 
lets a user select a virtual directory from 
the Baseline Server machine (which in 
Figure 5 is called “‘win_qa’’). 


which is a resource containing the relevant 
part of the IIS metabase in XML format. 
Each resource is listed together with its 
type. As we will see in the next section, 
this type determines how the application 
model is built. 


2. The Dependency Engine executes checks, such 
COM+ components. as whether IIS 5.1 is actually installed and run- 
d. Creates an application model for the SQL ning on the target server or a global ISAPI filter 
scripts. implementing application security is registered 


c. Creates an application model for the 


baseline server 
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Figure 4: The Master Server (our version of a Gold Server) connects to Remote Agents in different data centers that 
reside on each managed server. 
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in the metabase. It does so by querying the 
remote agent on the target server. 


3. The Deployment Engine parses the modeled 


methods for the IIS Web application. It then 
understands how to install the files and ISAPI 
filters on the target IIS server and which IIS 
services to shut down at the beginning and re- 
start at the end of the deployment. It also under- 
stands how to register the COM+ components 
and how and on which server to run the SQL 
scripts. It transmits the resources to the agent 
on the target server and has that agent execute 
the installation by making calls into the ADSI 
API for IIS and COM+ API. The agent also 
runs the SQL scripts with the appropriate SQL 
server as target. 


4. The Configuration Engine inserts all the captured 


configuration data from the XML files into the 
metabase of IIS on the target server and into the 
COM+ catalog. It executes the configuration by 
making calls into the ADSI and COM+ APIs. 
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Workflow 4: Analyze the Above N-Tier Web Applica- 


tion on Windows for “Out-of-band Changes" 


. The Analysis Engine parses the analysis meth- 


ods of the model for the IIS virtual directory, 
which guide it to do the following: (1) identify 
all the relevant configuration settings of the 
given application in the metabase of the IIS 
server and then extract these settings into an 
XML file; (2) capture all file metadata from the 
relevant virtual directories and all local ISAPI 
filter version and configuration data of the 
given application. 


. The Analysis Engine parses the analysis meth- 


ods of the model for the COM+ components, 
which guide it to query the COM+ catalog and 
capture the resulting configuration settings. 


. The Analysis Engine parses this captured infor- 


mation, compares it to the corresponding infor- 
mation in the master server’s repository (where 
the information relevant at the time of the last 
deployment is stored), and then presents any 
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Figure 5: Capture of the IIS virtual directory for FMStocks. 
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differences in a structured way to the user (e.g., 
the full name of the metabase field is presented 
with each difference in the Web application 
configuration). 


Application Aware Technology 


In this section, we discuss some of the technol- 
ogy behind the building blocks and functions pre- 
sented in the last section. 


Application Model and Infrastructure 


A model needs to capture all aspects of an appli- 
cation: the software features (directories, files, binaries, 
content, etc.) and the execution steps to deploy, config- 
ure, discover, and analyze the application. The model 
also needs to express the relationships among objects of 
interest, such as grouping (e.g., software features into 
an application, target servers into clusters) and 





“i CenterRun - co nents - Microsoft Internet Explorer 
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dependencies of one application on other applications 
being deployed, or dependencies on OS environments 
on the target server. Consequently, the model is respon- 
sible for all of the capturing, storing, and manipulating 
of application knowledge in the system. At the same 
time, for a user (system administrator, operations per- 
sonnel), the model is simply the means to an end, 
which is the automation of processes. That is why we 
decided that for common cases, the modeling task 
should be done by the system. The workflow, intro- 
duced in the previous section, allows a user to simply 
pick resources from a Baseline Server, group them into 
a modeled and checked-in application and then deploy 
such application with the “click of a button.” In order 
to achieve such a workflow, we put the following 
infrastructure in place: 
e A Component is a modeled application object, 
consisting of all the resources and methods 
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Figure 6: FMStocks application is checked-in and modeled. 
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(install, uninstall, discover, analyze) of an 

application. 
e A Resource Type is a first class object in the 
system. Each time a user checks a resource into 
the repository, a resource type name (COM_ 
PLUS, WAR_WebLogic, IIS_Web_Site, etc.) is 
associated with the resource. The resource type 
associates the corresponding type name with a 
set of behaviors for deployment, configuration, 
discovery, and analysis. 
A Resource Handler defines the actual behav- 
ior. It captures the methods to install, uninstall, 
discover, and analyze resources of a given type. 
A resource handler is implemented as a compo- 
nent itself. These components (““system compo- 
nents’) come bundled with the system, as 
opposed to application components built by the 
user. A system component might have its own 
resources, which are implementations (scripts, 
Java classes) of modeled methods for the sup- 
ported resource type. Its methods are accessible 
by other components being deployed onto the 
same target server and when calling these meth- 
ods parameters can be passed. Consequently, 
the resource type really associates a resource 
type name with a resource handler. These sys- 
tem components are deployed onto target 
servers transparently to the user, so that they 
are available to application components at the 
time of a user initiated deployment or analysis. 
The Model Language defines the syntax in 
which each component is expressed. We chose 
XML, which is sufficiently readable so that 
advanced users can manipulate the model 
directly if needed for customization, rather than 
just using the model builder. 


In the following we illustrate these concepts with 
some model fragments. 


<?xml version="1.0" encoding="UTF-8" 2?> 
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First, we show a fragment of the handler (see 
Listing 1) for resources of type COM+. This is a sys- 
tem component, bundled with the system and thus 
never has to be manipulated by the user. As depicted 
below, it consists of a few Windows Scripting Host 
scripts, which are bundled, in a resource called ‘““Com- 
PlusScripts.” Those scripts are deployed to each target 
server ahead of time, transparently to the user. The 
fragment only shows one modeled method, which is 
the “install” method, responsible for installing any 
resource of type COM+. This method takes an input 
parameter (rsrcDescription), which determines the 
COM+ object for the handler. The method then 
invokes one of the WSH scripts (Complus.wsf) via the 
cscript command shell, passing the input parameter as 
argument. 


Next, we show an application component frag- 
ment, which corresponds to the FMStocks sample 
application, mentioned in the previous section. The 
model builder generates this component during the 
workflow steps la-lg, as described in the previous 
section. 


The (complete) component contains a reference 
to each resource selected by the user in Steps 1b-1d of 
the workflow. The first two resources (FMStocks2000 
and FMStocks2000.xml) are the result of the user 
selecting the virtual directory. The first resource repre- 
sents all the content and ASP files, whereas the second 
resource represents the configuration data of the vir- 
tual directory on IIS, as it was set on the Baseline 
Server. The model builder transparently exports all 
relevant configuration data from the IIS metabase on 
the Baseline Server into the FMStocks2000.xml file. 
The third resource represents a COM+ resource. The 
model builder transparently exports the COM+ object 
from the Baseline Server into an MSI (Windows 
Installer) file. 


The installList XML block (see Listing 2) repre- 
sents the installation method. The deployResources tag 


- <component name="complusHandler" description="Installs com+ objects" > 
- <resourceList defaultInstallPath=":[install_path]" > 
<resource installName="complus" resourceName="ComPlusScripts" /> 


</resourceList> 
- - <controlList> 


- control name="install" description="Installs a com+ object" > 


- <paramList> 


<param name="rsreDescription" /> 


</paramList> 
- €execNative> 
- exec cmd="cscript" > 


<arg value="/Job:comPlusinstallApp" /> 
<arg value=":[install_path]\complus\ComPlus.wsf" /> 
<arg value="":[rsrcDescription]"" /> 


</exec> 
</execNative> 
</control> 
</controlList> 
</component> 


Listing 1: Fragment of the handler for COM+ resources. 
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causes for each resource a call to the appropriate 
resource handler installation method. For the COM+ 
resource, this causes the installation method in the pre- 
vious model fragment to be called, with parameters set 
to appropriate values obtained when the resource was 
captured. The installList XML block contains one step 
before and one step after the deployResources step. 
These steps stop and re-start IIS on the target server. 
They are calls to another system component, which 
models Windows services. The model builder gener- 
ates these calls explicitly, rather than implicitly 
through deployResources. These two steps are not 
attached to the installation of a single resource, but 
rather represent global behavior of the deployment of 
an n-tier Windows application. 


In the description above, we introduced our mod- 
eling infrastructure. We largely created our own solu- 
tion. Modeling frameworks in the application manage- 
ment space do exist, among them the following two: 

¢ CIM: The common information model 

[BSTWWO00] is a hierarchical, object-oriented 

modeling paradigm with relationship capabilities. 

The core schema of CIM contains objects model- 

ing both basic notions of system management 

and their relationships. There is an extension 
schema for applications, containing objects mod- 
eling (1) basic notions of software products, fea- 

tures, elements, and actions on these objects (2) 

their relationships. CIM has been used to model 

a large part of the Windows platform (the CIM 

derived model is called WMI) and the Solaris 

platform (see Solaris WBEM at [DMTF]). CIM 
is also being investigated for run-time application 
management in [KKS01]. 

¢ JSR77 ([JSR77]): A Java Specification Request, 
which proposes a standard management model 
for exposing and accessing the management 
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information, operations, and parameters of the 
Java 2 Platform, Enterprise Edition components. 
This proposal covers modeling the basic J2EE 
notions of J2EEServer, J2EEModule, EJBMod- 
ule, WebModule, etc., their relationships, and 
event management. The proposal also discusses 
mappings of this (specific) model into (the more 
generic) CIM. JSR77 encapsulates a lot of 
knowledge about the J2EE world. As pointed 
out in [FK02], JSR77 lacks expressiveness for 
describing runtime entities and for representing 
versions and dependencies. 


We found that the above models, while inspiring 
in many ways were not a good fit for us for an initial 
implementation. While CIM is a very expressive and 
extensible modeling framework, it comes with a rela- 
tively complex implementation and_ representation 
price. It also lacks two key ingredients for us: parame- 
ter passing and configuration management. Parameter 
passing appears to be required for implementing the 
notion of resource handler. Configuration management 
addresses the issue that the same application will be 
deployed with different configuration settings depend- 
ing on the environment and that a model must therefore 
allow such values to be generated dynamically at 
deployment time. See the subsequent section for more 
details. Furthermore, it is important that the model can 
be conveniently represented to advanced users, who 
desire to customize the deployment or analysis behay- 
ior. CIM does not offer a simple solution for this. 


Configuration Management 


We decided that configuration management is an 
area in which we needed to innovate. We designed an 
extension to our model, which allows the inclusion of 
variables, and an algorithm (see also Figure 7) which 
runs at deploy time to instantiate these variables. We 
also allowed users to replace actual values with 





<?xml version="1.0" encoding="UTF-8" ?> 


<component name="fmstocks" description="FMStocks sample application" > 
- -resourceList defaultInstallPath="\[install_path]" > 


<resource installName="FMStocks2000" 
<resource instal1lName="FMStocks.xml" 


resourceName="FMStocks2000" /> 
resourceName="FMStocks.xml" /> 


<resource instal1lName="FMStocks2000Core.msi" 
resourceName="FMStocks2000Core.msi" /> 


</resourceList> 
- <installList> 


- <controlService actionName="stop" componentName="servicesHandler" > 


- <argList> 


<arg name="serviceName"” value="IISADMIN" /> 


</argList> 
</controlService> 
<deployResources /> 


- <controlService actionName="start" componentName="servicesHandler"> 


- CargList> 


<arg name="serviceName" value="W3SVC" /> 


</argList> 
</controlService> 
</installList> 
</component> 


Listing 2: The installation method. 
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variables in configuration files associated with an 
application. The algorithm determines the configura- 
tion data by looking at least at the following environ- 
ment characteristics: 

e The characteristics of the server machine (IP 
address, hostname, network connectivity, pro- 
cessors, etc.). 

e The characteristics of the operating system of 
the server machine (OS type and version, patch 
level, etc.). 

¢ The characteristics of other software already on 
the server. 

e The characteristics chosen by a user (system 
administrator) at run-time of the deployment 
process. 


The algorithm reads the model as its first input. 
Depending on the content of the model, the algorithm 
then determines the environment characteristics by col- 
lecting the relevant data from each target server, from 
models of other applications provisioned on the target 
server, and from user generated input. The algorithm 
uses the collected information to generate values for 
settings within configuration files of the application and 
to transform the methods of the model into concrete 
execution steps, to be executed on each server to install 
and deploy the application, taking into account all the 
characteristics collected as described above. 


Let us revisit the petstore application introduced 
earlier (see Listing 3). Assume that petstore is to be 
deployed onto WebLogic 6.1. The path of petstore’s 
home directory on the Weblogic administrative 
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console depends on the Weblogic domain of the J2EE 
server, onto which petstore is deployed. Below is a 
fragment of an application component modeling this 
fact. The varList XML block contains declarations of 
variables used in this component. The domain_dir 
variable is used to hold the value of the home direc- 
tory path. The declaration allows defining a default 
value of a variable. In this case, the default value is a 
path with a fixed prefix (/opt/bea/wlserver6. l/config) 
and a suffix, which refers to another variable 
domain_name in another application component called 
WL61Server. That component models the WebLogic 
server instance on the same target server and that vari- 
able contains the domain name of the server. The algo- 
rithm above instantiates the variable domain_dir by 
examining the configuration state of the installed JZ2EE 
WebLogic server. 


Deployment Management 


The Deployment Manager handles all aspects of 
the deployment of a modeled application; including 
efficient WAN distribution of potentially large 
amounts of data and executing the modeled deploy- 
ment methods on each target server. This is why we 
decided to use a local cache (local distributor) on 
LANs, so that a single master and console can control 
multiple data centers. Another issue is security of the 
communication. However, none of these issues relate 
to the main theme of this paper, application awareness. 


From an implementation point of view, it is 
worthwhile to note that JSR 88 (see [JSR88]) is a pro- 
posed standard API for the deployment of J2EE 


Set of Target Servers 


User Input 








Application Models of 
Deployed Applications 


el 


For each target server: 
-- Instructions for deploying A 
-- Application configuration for A 


Figure 7: Configuration management algorithm. 
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applications. The specification aims at defining the 
contracts that enable solutions from multiple providers 
to configure and deploy applications on any J2EE 
platform (e.g., Weblogic, WeSphere, iPlanet, etc.). We 
need to do deployment across platforms (UNIX, Win- 
dows) and well beyond the scope of J2EE applica- 
tions; still, standardization of this form is a welcome 
simplification for management tools like ours. 


Dependency Management 


Dependency information is expressed within our 
model as relationships among applications. For exam- 
ple, a J2EE application might only be deployable onto 
WebSphere V4 or newer; WebSphere itself might 
require the JDK x.y already installed, which in turn 
requires a certain patch-level of the Solaris OS. On the 
Windows side, an II[S-based application might require 
IS V5 or newer, etc. In CIM, relationships are 
expressed as objects themselves. We decided to model 
relationships simply as object attributes, which is a 
simpler implementation at the cost of some flexibility 
(e.g., modeling a relationship both directions) and 
extensibility (e.g., adding a relationship without 
changing the object in the relationship itself). 


An application can consist of several software 
features. These features might get installed on separate 
machines (Web server, app server, console for app 
server). Also yet another set of machines (e.g., 
database server) might require configuration changes. 
Our model captures these inter-machine dependencies 
in the deployment methods and their target servers. 
Such dependencies are a simplified form of the ones 
considered in [EKO1]. Our approach therefore could 
greatly benefit from integration with a solution, such 
as presented in that paper. 


Automated Analysis 


The remote agent does the analysis of the installed 
application with parsers and tools, which have appli- 
cation knowledge. For example, in order to analyze 
the configuration state of an IIS / WebSphere / 
Weblogic application, the remote agent needs to (1) 
determine which values are relevant in the correspond- 
ing configuration store (metabase / centralized 
database / XML store) and (2) export these values to 
the master server. Analyzing Apache Web server con- 
figuration data requires a parser understanding the 
“httpd.conf” file. 


We implement analysis in a very analogous fash- 
ion to deployment. Each resource handler has a 


<varList> 
<var name="domain_dir" 


method, which models the steps to obtain the appro- 
priate configuration data. Analysis consequently 
implements the runtime environment to that model 
aspect, which results in the master server obtaining all 
the relevant data in a well-defined XML format, ready 
to be analyzed. 


Cost Benefits 


The organization depicted in the case study of 
the second section has been spending the following 
dollar amounts on application management per year: 

e $450K in staff cost for writing deployment 
scripts, documentation on deployment methods 
and best practices, etc. 

¢ $1.1M in staff cost for executing manual 
changes, deployments, etc. on the servers. 

¢ $500K in staff cost for emergency response to 
application failures, deployment failures, etc. 

> This organization spends roughly $2,050K on total 
cost of ownership or $17K on each of their 120 
servers. 


The organization has now adopted our technol- 
ogy in their extranet production environment. Using 
our technology, they were able to automate the appli- 
cation deployment process from end-to-end. System 
reliability has improved, both due to shorter mainte- 
nance windows and due to less unplanned down time. 
They were able to free up resources dedicated to appli- 
cation deployment, lowering operating costs. Specifi- 
cally, they have experienced the following benefits: 

1. Drastically reduced time to deploy applications 
(Apache, WebSphere, and Weblogic base 
infrastructure and J2EE applications residing 
on this base infrastructure) through our automa- 
tion technology (early indications show the 
quantitative gain to amount to an 80% reduc- 
tion). 

2. Eliminated most errors that typically occurred 
during deployment (where the remaining issues 
are mostly around process issues among the 
operations staff). 

3. Significantly reduced firefighting and server 
rebuilds by using our analysis technology to 
track “out-of-band” changes (i.e., changes to 
installed applications made ad-hoc by opera- 
tions personnel). 

4. Increased number of applications supported 
while eliminating reliance on external consul- 
tants. 


default="/opt/bea/wiserver6.1/config/:[component:WL61Server:domain_name]" /> 


</varList> 


<resourceList defaultInstallPath=":[domain_dir]" > 


<resource installName="petstore.ear" resourceName="wit-petstore/petStore61.ear" 


installPath="applications" /> 
</resourceList> 


Listing 3: Enhanced petstore application. 
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5. Leveraged reusability of the technology to con- 
sistently deploy applications across five differ- 
ent environments. 

6. Automatically extracted application builds from 
Rational ClearCase into our version-controlled 
repository, reducing errors associated with 
application builds. 


As a result, they could reduce the management 
costs to the following amounts: 
e $50K in staff cost for authoring and fine-tuning 
models for their applications. 
¢ $260K in staff cost for running deployments, 
upgrades etc. 
¢ $190K in staff cost for using analysis tools to 
investigate failures, etc. 
> Cost of ownership has dropped to roughly 
$500K per year, a 75 percent drop from the “before 
picture,” totaling $4.2K per server. 


Conclusions 


In this paper, we have presented our management 
approach to deployment and analysis of Web applica- 
tions. We have argued that an effective solution needs 
to be application aware and have shown what technol- 
ogy makes up such a solution. Finally, we have 
described some quantitative implications on the total 
cost of ownership (TCO) of an Internet Data Center. 
We believe that application aware management is a 
new space, with a lot of potential for creating new and 
exciting technology and solutions. 
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ABSTRACT 


This paper presents the results of a proof-of-concept implementation of an on-going project 
to create a cost effective method to provide geographic distribution of critical portions of a data 
center along with methods to make the transition to these backup services quick and accurate. The 
project emphasizes data integrity over timeliness and prioritizes services to be offered at the 
remote site. The paper explores the tradeoff of using some common clustering techniques to 
distribute a backup system over a significant geographical area by relaxing the timing 
requirements of the cluster technologies at a cost of fidelity. 


The trade-off is that the fail-over node is not suitable for high availability use as some loss of 
data is expected and fail-over time is measured in minutes not in seconds. Asynchronous 
mirroring, exploitation of file commonality in file updates, IP Quality of Service and network 
efficiency mechanisms are enabling technologies used to provide a low bandwidth solution for the 
communications requirements. Exploitation of file commonality in file updates decreases the 
overall communications requirement. IP Quality of Service mechanisms are used to guarantee a 
minimum available bandwidth to ensure successful data updates. Traffic shaping in conjunction 
with asynchronous mirroring is used to provide an efficient use of network bandwidth. 


Traffic shaping allows a maximum bandwidth to be set minimizing the impact on the existing 
infrastructure and provides a lower requirement for a service level agreement if shared media is 
used. The resulting disaster recovery site, allows off-line verification of disaster recovery 
procedures and quick recovery times of critical data center services that is more cost effective than 
a transactionally aware replication of the data center and more comprehensive than a commercial 
data replication solution used exclusively for data vaulting. The paper concludes with a discussion 


of the empirical results of a proof-of-concept implementation. 


Introduction 


Often data centers are built as a distributed sys- 
tem with a main computing core consisting of multiple 
enterprise class servers and some form of high perfor- 
mance storage subsystems all connected by a high 
speed interconnect such as Gigabit Ethernet [1]. The 
storage subsystems are generally combinations of Net- 
work Attached Storage (NAS), direct connected stor- 
age or a Storage Area Network (SAN). Alvarado and 
Pandit [2] provide a high-level overview of NAS and 
SAN technologies and argue that these technologies 
are complimentary and converging. 


In order to increase the availability of such a sys- 
tem, the aspects of the systems’ reliability, availability, 
and serviceability (RAS) must be addressed. Reliabil- 
ity, availability and serviceability are system level 
characteristics, which are invariably interdependent. 
Redundancy is the way resources are made more 
available. This redundancy permits work to continue 
whenever one, and in some cases, more components 
fail. Hardware fail-over and migration of software ser- 
vices are means of making the transition between 
redundant components more transparent. Computer 
system vendors generally address hardware and oper- 
ating system software reliability. For example, Sun 
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Microsystems has advertised a guaranteed 99.95% 
availability for its standalone Enterprise 10000 
Servers [3]. Serviceability implies that a failure can be 
identified so that a service action can be taken. The 
serviceability of a system obviously directly affects 
that system’s availability. A more subtle concern is the 
impact of increasing system reliability and redundancy 
through additional components. Each additional soft- 
ware or hardware component adds failure probabilities 
and thus any project to increase the availability of a 
system will involve a balance of reliability and ser- 
viceability as well. Network Appliance provides a 
good example of this tradeoff in the design of their 
highly available (HA) file servers [4]. 


Disaster protection and catastrophic recovery tech- 
niques are not generally considered as part of a vendor 
HA solution, but the economic reasons which drive HA 
solutions [5] demand contingency planning in case of a 
catastrophic event. In short, HA protects the system; dis- 
aster recovery (DR) protects the organization. 


Recent technical capabilities, particularly in the 
area of networking have enabled common-off-the-shelf 
(COTS) clusters to emerge. This paper begins to exam- 
ine the same technologies and general techniques used 
in COTS clusters for their feasibility as techniques to 
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provide geographically distributed systems appropriate 
for use as remote disaster protection facilities at reason- 
able cost. In the paper we define geographically dis- 
tributed in terms of limited communications bandwidth 
not a distance measurement. 


The goal of this work is to outline a design which 
provides a DR facility that can be made operational 
quickly for critical functions, provide a means of veri- 
fying DR plans and procedures, minimize data loss dur- 
ing the disaster and provide the basis for the reconstruc- 
tion of the company’s computing base. The premise of 
the paper is to explore the use of HA cluster technolo- 
gies to distribute a backup system over a significant 
geographical area by relaxing the timing requirements 
of the cluster technologies at a cost of fidelity. 


The trade-off is that the fail-over node is not suit- 
able for HA usage as some loss of data is expected and 
fail-over time is measured in minutes not in seconds. 
Asynchronous mirroring, exploitation of file common- 
ality in file updates, IP Quality of Service (QoS) and 
network efficiency mechanisms are enabling technolo- 
gies used to provide a low bandwidth solution for the 
communications requirements. Exploitation of file 
commonality in file updates decreases the overall com- 
munications requirement. IP QoS mechanisms have 
been designed to add support for real-time traffic. 


The design presented takes the real-time require- 
ments out of the HA cluster but uses the QoS mecha- 
nisms to provide a minimum bandwidth to ensure suc- 
cessful updates. Traffic shaping in conjunction with 
asynchronous mirroring is used to provide an efficient 
use of network bandwidth. Traffic shaping allows a 
maximum bandwidth to be set, minimizing the impact 
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on the existing infrastructure and provides a lower 
requirement for the service level agreement (SLA). 


The next section outlines the approach used for 
providing a geographically distributed system to sup- 
port continued operation when a disaster has occurred. 
Then, DR, HA and IP QoS background is provided. 
The subsequent section provides details and results of 
a specific proof-of-concept study. Impacts of the DR 
elements on the production system are examined along 
with the issues found during the case study. The paper 
concludes with a summary and discussion of future 
work on this project. 


The Approach 


When developing a DR contingency plan, the 
restoration of critical operations of a data center takes 
priority. This section proposes a design for a “warm 
backup” site for such operations using existing com- 
munication channels or lower bandwidth commercial 
communication channels. This design is a compromise 
between the restoration from backup tapes and a fully 
synchronous DR facility. The third section offers fur- 
ther discussion of DR options. Tape restoration will be 
required for non-critical services. The backup site is 
also not kept synchronous with the primary but is syn- 
chronized at a point in the past through asynchronous 
mirroring. This solution is appropriate for organiza- 
tions that can survive some loss of availability along 
with potentially the loss of some data updates. The 
case study was intended to evaluate the feasibility and 
gain of the proposed DR solution. Specifically, the 
case study is intended to evaluate the feasibility of 
deploying the DR system supporting one terabyte of 
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Diagram 1: Proposed DR architecture. 
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data with no more than 24 hours of lost data updates 
over an existing T1 (1.544 Mbps) line or another alter- 
native is to use a small portion of a shared WAN. 


Topology 


The topology of the data center is assumed a cen- 
trally located facility of a few to tens of application, 
file and database servers. Users interface to these ser- 
vices through desktop clients. 


The topology of the DR site is assumed to be one 
of each compatible server interconnected with its pri- 
mary. Thus, there is a consolidation at the DR site of 
critical services on many production servers to a sin- 
gle DR server. The data associated with each service 
must be available to the service’s server. All DR 
servers are interconnected over a LAN. The applica- 
tion consolidation greatly reduces the cost of the DR 
site. However, the application consolidation compli- 
cates the network identity fail-over and is discussed in 
the Network Identity Fail-over section. 


The production environment for the case study is 
comprised of data service provided via multiple Net- 
work Appliance file servers (known as filers) and 
application services provided by multiple Sun 
Microsystems Enterprise 5x00 servers and an Enter- 
prise 10000 server. At the DR site, the application ser- 
vices are being replicated on an Enterprise 5000 server 
and data services are being consolidated to a single 
Network Appliance F740 filer. Diagram 1 provides an 
overview of the target architecture. In the proof-of- 
concept a subset of applications and data are being 
tested on the local LAN to determine feasibility and 
SLA requirements for DR deployment. The proof-of- 
concept test environment is discussed further in the 
study results section. 


Fail-over 


A survey of commercial HA solutions (see the 
Background section) can be generalized as providing 
for the movement of three critical elements from the 
failed server(s) to the backup: the network identity, the 
data, and the set of processes associated with that data. 
Additionally, a service to monitor the health of each 
primary service, each backup server, and communica- 
tions between primary and backup servers is required. 
This service is known as a heartbeat. In HA solutions 
this fail-over is normally automated. Since the prob- 
lem at hand is providing availability in the face of a 
disaster, which may be predicted and a preventative 
fail-over initiated or it may be prudent to delay initia- 
tion of a fail-over until the culmination of a disaster 
has occurred, a manually initiated automated fail-over 
process is used. 


Heartbeat 


A mechanism to determine system and communi- 
cations liveliness is required, but the determination is 
not required continuously as it is for HA. The main 
issue for this DR site is to keep the data and processes in 
synchronization to the fidelity of the DR requirements. 
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Fail-over does not rely on a heartbeat for initiation and 
synchronization occurs through asynchronous mirroring 
or shadowing periodically not continuously. Therefore, 
the determination of system liveliness is required only 
before the initiation of a synchronization process. The 
heartbeat mechanism will need to be specific to the file 
service, mirroring software, and communication tech- 
nology used. If any errors occur, operational personnel 
need to be alerted of the problem, but there should be 
no impact to the production data center. 


In the case study, a Korn shell script was written 
to determine system liveliness. As described in the 
Data Migration subsection below, the remote mirror- 
ing occurs on a volume basis so to determine file sys- 
tem liveliness, prior to initiation of each volume syn- 
chronization, the primary and backup filer status is 
checked and logged via a series of remote status com- 
mands (e.g., from Solaris: rsh nacbac sysstat). The sta- 
tus of the primary and backup servers and communi- 
cations network liveliness is verified and logged by 
checking the respective network interfaces using vari- 
ous operating system supplied status utilities (e.g., 
from Solaris: ping, netstat). In the prototype, if any 
errors occur, e-mail is sent to the personnel involved 
in the test. In the final production system, alerting 
operational personnel should be integrated into the 
system and network management platform. 

Process Migration 

If the DR site is a mirror of the production data 
center then commercial shadowing software can be 
used to synchronize data, applications and system con- 
figurations. Since it was assumed that the DR site is 
not a mirror of the data center, services must be priori- 
tized into separate categories. Each category should 
have an increasing tolerance for unavailability. These 
services must be installed, configured and updated 
along with the primary servers in the data center. 


In the case study, only select services are 
installed on the DR servers and all services must be 
restarted at fail-over. This fail-over involves bringing 
the data online with read and write access and a reboot 
of the servers. 


Data Migration 


In a DR situation a copy of the data associated 
with the migrated services must be provided to the DR 
facility. The integrity of the data must be ensured along 
with the synchronization of the system as a whole. 
Commercially, data replication solutions provide a 
method of supplying and updating a copy of the data at 
an alternate site. Commercial Data Replication solutions 
are database, file system, OS or disk subsystem specific; 
thus, enterprises may be required to use multiple solu- 
tions to protect their critical data. The Gartner Research 
Note T-13-6012 [6] provides a table that differentiates 
24 products by the options they support. 


In the production environment for the case study, 
data replication solution must be provided for the net- 
work attached storage, direct attached storage controlled 
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under the Solaris OS and Veritas volume management, 
along with special considerations for Sybase and Oracle 
data. The initial proof-of-concept is looking at a subset 
of the production data environment, specifically the net- 
work attached storage with Oracle database data. 


Bandwidth Utilization 


For this project geographic distribution has been 
defined in terms of limited communications band- 
width, thus our design seeks to minimize communica- 
tion requirements. Three bandwidth limiting tech- 
niques and compromises are used. 


Our first compromise is with the heartbeat. The 
heartbeat, as previously discussed, is relaxed from a 
real-time or near real-time monitor to one that only 
requires activation upon a synchronization event. In 
the case study, this was once daily and the heartbeat’s 
impact is negligible. 

Our second compromise is with data replication. 
The data is shadowed not synchronously mirrored. 
This allowed the use of a network efficiency mecha- 
nism known as traffic shaping. See Diagram 2. 


The objective of traffic shaping is to create a 
packet flow that conforms to the specified traffic 
descriptors. Shaping introduces a queuing of network 
traffic, which is transmitted at a fixed rate resulting in 
high network utilization. Shaping may change the 
characteristics of the data flow, by intentionally delay- 
ing some packets, thus the need for asynchronous mir- 
roring. Traffic shaped asynchronous mirroring enables 
data synchronization between the local and remote 
copies to occur over long periods of time with a con- 
stant network impact. 


Packet 


FIFO Queue 


To network 
Diagram 2: Traffic shaping. 


Even if traffic is shaped at the source, it may 
become jittered as traffic progresses through the net- 
work. To address this jitter, a point-to-point link is 
required or the traffic should be shaped just prior to 
entering the low bandwidth link. 


Traffic shaping allows a maximum bandwidth to 
be set minimizing the impact on the existing infras- 
tructure and provides a lower requirement for the ser- 
vice level agreement (SLA). In any communication 
system using traffic shaping, the finite queue must 
remain stable. 


Queue stability relies on two parameters, the 
inter-arrival time and the service rate. The service rate 
of the queue must be greater than data inflow; in our 
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case, this means setting the maximum data rate 
allowed on the network high enough. Secondly, the 
network must be able to successfully transmit the data 
when serviced out of the queue. IP QoS mechanisms 
are used to guarantee the necessary bandwidth avail- 
ability. 

Bandwidth availability is greater than required to 
perform traffic shaped data synchronization, but the 
high network utilization afforded from traffic shaping 
will prevent over design to accommodate peak loads 
over the low-bandwidth link. Traffic shaping and other 
IP QoS routing mechanisms specifically in a Cisco 
IOS environment are further discussed in the Back- 
ground IP Quality of Service section 


Our final effort to minimize communications 
between the local and remote site is an exploitation of 
file commonality in file updates. Data shadowing 
products were evaluated which allowed block level 
updates as opposed to file level updates. It is expected 
that block level updating will significantly reduce the 
required communications. 


Network Identity Fail-over 


The fail-over of the network identity is driven by 
client availability and for the purpose of DR is more 
properly stated restoration of client access. If the DR 
scenario allows for client survivability, the movement 
of network identity must be addressed. If the DR sce- 
nario requires clients to also be replaced, network 
identity becomes secondary to the client replacement 
process. An example of client replacement is provided 
in later in this section. 


When a fail-over occurs, the IP address and logi- 
cal host name used by the Data Center server need to 
migrate to the DR server. Normally, this is done by 
reconfiguring the public network interfaces on the 
takeover server to use the public IP address. This pro- 
cess is complicated by the mapping of the hardware 
MAC addresses to IP addresses. 


The Address Resolution Protocol (ARP) is used 
to determine the mapping between IP addresses and 
MAC addresses. It is possible, and common on Sun 
platforms, to have all network interfaces on a host 
share the same MAC address. Many system adminis- 
trators tune the ARP cache used by clients to store the 
IP-MAC addresses for anywhere from 30 seconds to 
several hours. When a fail-over occurs, and the IP 
address associated with the service is moved to a host 
with a different MAC address, the clients that have 
cached the IP-MAC address mapping have stale infor- 
mation. There are several ways to address the prob- 
lem: 

¢ Upon configuration of the DR server’s network 
interfaces, a “gratuitous ARP” is sent out 
informing other listening network members that 

a new IP-MAC address mapping has been cre- 

ated. Not all machines or operating systems 

send gratuitous ARPs, nor do all clients handle 
them properly. 
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¢ The MAC address can be moved from the data 
center server to the DR server. The clients need 
to do nothing, as the IP-MAC address mapping 
is correct. Switches and hubs that track MAC 
addresses for selective forwarding need to han- 
dle the migration; not all equipment does this 
well. Binding an Ethernet address to an inter- 
face is shown using the Solaris naming and 
configuration syntax: 


ifconfig qfel ether 8:0:20:1la:2b:33 


e Wait for the clients ARP cache entries to 
expire, resulting in the clients realization that 
the host formerly listening on that MAC 
address is no longer available and send a new 
ARP requests for the public IP address. 


Movement of the IP address and logical name 
from the data center server to the DR server is simpler. 
The use of a virtual hostname and IP address is com- 
mon. Most network interface cards support multiple IP 
addresses on each physical network connection, han- 
dling IP packets sent to any configured address for the 
interface. The data center hostname and IP address are 
bound to virtual hostname and IP address by default. 
DR and data center server synchronization can occur 
using the “real” IP address/hostnames. At fail-over, 
the virtual hostname and IP address are migrated to 
the DR server. Clients continue to access data services 
through the virtual hostname or IP address. 


Enabling a virtual IP address is as simple as con- 
figuring the appropriately named device with the logi- 
cal IP address, here shown again using the Solaris 
naming and configuration syntax: 

ifconfig hmeO:1 jupiter up 
and the “real” addresses are configured one of two 
ways: on a data center server named europa 


## ifconfig hmeO plumb 
## ifconfig hmeO europa up 


or on a DR server named io 


## ifconfig hmeO plumb 

## ifconfig hmeO io up 
The virtual IP is associated with the physical hme0 
interface. 


If a DR server is a consolidation of several data 
center servers, virtual IP addresses can be set up for 
each data center server on the DR server. MAC 
addresses are assigned per interface so installing an 
interface for each consolidated server allows movement 
of the Ethernet addresses. Otherwise, waiting for the 
ARP cache timeout or a gratuitous ARP can be used. 


Client Service Migration 


The final task is how to re-establish the clients. 
Assortments of clients are in use within the case 
study’s production environment (PCs, UNIX Worksta- 
tions and thin clients) but for DR, Sun Ray thin clients 
were chosen. The Sun Ray server software is installed 
on the DR server to drive the thin clients. A small 
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number of thin clients are being set-up at the remote 
site to allow quick recovery with capabilities to add up 
to 50 units if needed. This is far below the production 
environment’s normal user load (see Chart 1), but rep- 
resents a first step towards a return to normalcy. 
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Chart 1: Usage load. 


The Sun Ray enterprise system is a Sun 
Microsystems’ solution based on an architecture that 
Sun calls the Sun Ray Hot Desk Architecture [7]. The 
Sun Ray enterprise system provides a low cost, desk- 
top appliance that requires no desktop administration, 
is centrally managed, and provides a user experience 
equivalent to that of a Sun workstation if servers and 
networks are properly sized [8]. The Sun Ray appli- 
ance is stateless, with all data and processing located 
on the server. Access is provided to both Solaris and 
Microsoft Windows 2000 TSE through the Citrix ICA 
client from a single desktop [9]. The Windows Citrix 
Servers provide administrative services and are not 
part of the DR site design but will be required to be 
rebuilt from tape on new hardware in the event of a 
disaster. 


Background 


Failures caused by a catastrophic event are 
highly unlikely and difficult to quantify. As a result, 
catastrophic event failures are not normally accounted 
for in most HA calculations even though their rare 
occurrence obviously affects availability. The back- 
ground section begins by defining availability and the 
levels of availability. DR, HA and the relationship 
between the two respectively is then introduced. The 
background section concludes with an introduction to 
IP QoS mechanisms provided by network routers, 
with a heavy bias toward QoS features supported in 
Cisco’s IOS. Cisco routers and switches are used in 
the case study production environment. 

Availability 

Availability is the time that a system is capable of 
providing service to its users. Classically, availability is 
defined as uptime / (uptime + downtime) and provided 
as a percentage. High availability systems typically 
provide up to 99.999 percent availability, or about, 
five minutes of down time a year. The classic defini- 
tion does not work well for distributed or client/server 
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systems. Wood of Tandem Computers [10] presents an 
alternative definition where user downtime is used to 
make the availability calculation. Specifically, 

user uptime 


total users USEr uptime + user downtime 


total users 
expressed as a percentage. 
Wood continues in his 1995 paper to predict 
client/server availability. The causes of failures used in 
Wood’s predictions are summarized in Figure 1. 


Hardware 
14% 


tal 





Figure 1: Causes of downtime. 


Barr of Faulkner Information Services, relying 
heavily on research from the Sun Microsystems Corpo- 
ration, provides a more moder breakdown of the 
causes of unplanned downtime reflecting improvements 
in hardware and software reliability [11]. Barr considers 
three factors as contributing to unplanned system 
downtime: Process, People and Product. Barr states that 
Process and People each account for 40 percent of 
unplanned downtime and Product accounts for the 
remaining 20 percent. Barr defines unplanned Product 
downtime to include: hardware, software and environ- 
mental failures. The comparison of Wood’s and Barr’s 
causes of unplanned downtime demonstrates the trend 
for vendor software, hardware and environmental relia- 
bility improvements to improve overall availability 
while driving the causes of unplanned downtime more 
toward their customers’ implementations. 


Availability Cost 


As with any system, there are two ways to 
improve the cost structure of the system: increase pro- 
ductivity and/or decrease expenditures. As implied by 
Wood’s availability definition, computers and com- 
puter systems were created to improve the perfor- 
mance of our work — thus our productivity. The qual- 
ity of the service provided by the system is an end-to- 
end statement of how well the system assisted in 
increasing our productivity. 


The way to increase productivity of the user 
community is to increase the availability of the system 
in a reliable way. HA solutions provide cost benefits 
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by pricing out downtime verses the cost of hardware, 
software and support. The way to decrease expendi- 
tures is to increase the productivity of the system sup- 
port staff by increasing the system’s reliability and ser- 
viceability. 

Availability Levels 


Increasing levels of availability protect different 
areas of the system and ultimately the business. 
Redundancy and catastrophic recovery are insurance 
policies that offer some availability gains. A project to 
increase availability would expect an availability 
verses investment graph to look similar to the one pre- 
sented in Figure 2 (adapted from [12]) and can be 
viewed as having four distinct levels of availability: 
no HA mechanisms; data redundancy to protect the 
data; system redundancy and fail-over to protect the 
system; and disaster recovery to protect the organiza- 
tion. As you move up the availability index, the costs 
are cumulative as the graph assumes the HA compo- 
nents are integrated in the system in the order pre- 
sented. 


At the basic system level, no availability 
enhancements have been implemented. The method 
used of data redundancy will be some form of backup 
normally to tape. System level failure recovery is 
accomplished by restoration from backup. The contin- 
gency planning for disaster recovery is most often 
what has been called the “Truck Access Method” 
(TAM) or a close variant. TAM is briefly discussed in 
the next section. 
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Figure 2: Availability index. 


Data protection provides the largest cost benefit 
of all availability levels. Data protection is important 
for three reasons. First, data is unique. Second, data 
defines the business. Finally, the storage media hous- 
ing data is the most likely component to fail. At the 
data protection level, a redundant array of inexpensive 
disks (RAID) [13] solution as appropriate for the envi- 
ronment is implemented. RAID solutions protect the 
data from loss and provide availability gains during 
the reconstruction of the data. Volume management 
software can be added to automate and enhance many 
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of the system administration functions associated with 
the RAID disk abstraction. 


At the system protection level, redundant HW 
components are added to enhance the ability to 
recover after a failure through system reconfiguration 
and fault tolerance. Automatic system reconfiguration 
(ASR — logically removing the failed components) and 
alternate pathing of networks and disk subsystems 
allow the recovery from hardware failures and coupled 
with a locally developed or commercial HA solution 
enables automatic detection and recovery (fail-over) 
from software service failures. 


Disaster recovery is protection for the organiza- 
tion from a business or program-ending event. Disas- 
ter recovery differs from continuity of operations in 
that continuity of operations is proactive and common 
solutions often rely on transactionally aware duplicate 
of the production data center to a remote site. Disaster 
recovery is reactionary. DR assumes some loss of avail- 
ability and data is acceptable and obtains a cost benefit 
for this assumption. The differences between the two 
therefore are cost, reliability and system availability. 


Disaster Recovery 


DR is a complex issue. DR forces an organization 
to prepare for events that no one wants to or really can 
prepare for and to discuss issues that few people are 
comfortable discussing, such as the loss of key person- 
nel. This paper focuses only on the physical problem of 
moving critical functions from a data center to a DR 
site quickly and safely, which is only a portion of disas- 
ter contingency planning. There are several sources [14, 
15] that discuss surviving and prioritizing [16] a com- 
puter disaster from a management perspective. 


Kahane, et al. [17] define several solutions for 
large computer backup centers. The solutions differ by 
response time, cost and reliability. Among these solu- 
tions: 

1. Hot Backup — Maintaining an additional site 
that operates in parallel to the main installation 
and immediately takes over in case of a failure 
of the main center. 

2. Warm Backup — Maintaining another inactive 
site ready to become operational within a mat- 
ter of hours. 

3. Cold Backup — Maintaining empty computer 
premises with the relevant support ready to 
accept the immediate installation of appropriate 
hardware. 

4. Pooling — A pooling arrangement where a few 
members join into a mutual agreement concern- 
ing a computer center, which is standing-by idle 
to offer a service to any member suffering from 
interruption in its computer center. The idle 
computer center may be employed in the form 
of ‘“‘cold” or ‘‘warm” backup. 


The focus of this project is the creation of a 
“warm backup” site that integrates an easily testable 
DR capability into Data Center operations. 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


Geographically Distributed System for Catastrophic Recovery 


Disaster Recovery Approaches 


There are generally two common extremes to 
continuing operations planning, restoration from tape 
and a fully replicated synchronous backup facility. 


The simplest method of DR preparation has been 
called the “Truck Access Method”’ (TAM) or a close 
variant. For TAM, periodic backups of all systems and 
data are made and stored at a safe and secure off-site 
facility. The advantage of this method is cost, but there 
are three main disadvantages affecting reliability and 
availability. First, the number of tapes can be quite 
large. A single bad or mislabeled tape can hamper DR 
extensively. Second, procurement and installation of 
infrastructure components is time-consuming and any- 
thing short of replication of the production data center 
greatly complicates restoration from tape. Full restora- 
tion from tape is also very time consuming. Lastly, test- 
ing the DR procedures is complicated and can result in 
downtime. Extensions to the TAM method include the 
use of a commercial DR facility or backup pool [17] or 
the construction of a redundant data center. 


At the other extreme, a remote backup facility can 
be constructed for DR where order-preserving transac- 
tions are used to keep the primary and the backup data 
synchronous [18, 19]. This approach often involves the 
addition of front-end processors and a mainframe at the 
remote site. Additionally, the communication link 
between the data center and remote site will be expen- 
sive and potentially can degrade performance of the host 
applications. In designing a synchronous backup facility, 
replication is the driving consideration. Wiesmann, et al. 
[20] provides an overview of replication in distributed 
systems and databases along with providing a functional 
model that can be used in designing a specific replica- 
tion solution. 


A common compromise is to employ data vault- 
ing [21] where commercially available data replication 
solutions mirror or shadow the data to an alternate loca- 
tion, reducing recovery time but mostly reducing risks. 


HA Technologies 


Replicated hardware solutions have traditionally 
been used to provide fault tolerance [22]. Fault toler- 
ance refers to design techniques such as error correc- 
tion, majority-voting, and triple modular redundancy 
(TMR), which are used to hide module failures from 
other parts of the system. Pfister [23] and Laprie, et al., 
[24] can provide the reader more background on fault 
tolerant systems. Fault Tolerant systems can be classi- 
fied as primarily hardware or primarily software. Hard- 
ware fault tolerant systems tend to be more robust with 
quicker recovery time from faults but tend to be more 
expensive. Software systems create very fast recovery 
by providing a method of migrating a process and its 
state from the failed node to a fail-over node. Milojicic, 
et al. [25] provides a survey of the current state of pro- 
cess migration. Fault-tolerant systems are synchronous 
and the low latency requirements between systems 
make their use for DR impractical. 
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HA systems have the distinction from fault toler- 
ant systems in that they recover from faults not correct 
them. A latent bug in application code which causes a 
fault is unlikely to re-occur after a fail-over in a HA 
solution as the application will be re-initialized. This 
distinction makes HA solutions the preferred solution 
for software failures and most operational failures. 
Fault tolerant systems are generally used in mission 
critical computing where faults must be masked and 
are often components in HA systems. 


The basic model for building a HA system is 
known as the primary-backup [26] model. In this 
model, for each HA service one of the servers is desig- 
nated as the primary and a set of the others are desig- 
nated as backups. Clients make service requests by 
sending messages only to the primary service 
provider. If the primary fails, then a fail-over occurs 
and one of the backups take over offering the service. 
The virtues of this approach are its simplicity and its 
efficient use of computing resources. Servers provid- 
ing backup services may be “‘hot-standbys” or them- 
selves providers of one or more primary services. The 
primary-backup model of HA system provides a good 
model for the creation of a “warm backup” site. 


In building an HA system, ensuring data 
integrity in persistent storage during the fail-over pro- 
cess is the most important criteria, even more impor- 
tant than availability itself. In general, there are two 
methods for providing a shared storage device ina HA 
cluster: direct attached storage or some form of net- 
work storage. 


Direct attached storage is the most straightfor- 
ward persistent storage method using dual-ported disk. 
The issue is one of scalability. As an HA cluster 
requires disk connection to all primary and fail-over 
nodes, clusters of greater than four to eight nodes can- 
not support direct attached persistent storage and 
require disk systems be deployed in some form of a 
storage network. 


Storage Area Network (SAN) environments are 
dedicated networks that connect servers with storage 
devices such as RAID arrays, tape libraries and Fiber 
Channel host bus adapters, hubs and switches. Fiber 
Channel SANs are the most common SANs in use 
today [27]. Fiber Channel SANs offer gigabit perfor- 
mance, data mirroring and the flexibility to support up 
to 126 devices on a single Fiber Channel-Arbitrated 
Loop (FC-AL) [28]. SANs are a maturing technology 
as such there are numerous developing standards and 
alliances for Fiber Channel SAN design. Furthermore, 
distance limits for a SAN are the same as the underly- 
ing technologies upon which it is built. An emerging 
standard, Fiber-Channel over TCP/IP (FCIP) [29] has 
been proposed to the Internet Engineering Task Force 
(IETF). FCIP offers the potential to remove the dis- 
tance limitations on SANs. 


An alternative to SAN environments is NAS 
(Network Attached Storage) networks. NAS networks 
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also connect servers with storage devices but they do 
so over TCP/IP via existing standard protocols such as 
CIFS (Common Internet File System) [30] or NFS 
(Network File System) [31]. 


The use of a storage network to provide data 
access during fail-over is sufficient for HA but does 
not provide for the separation of resources necessary 
in DR. DR needs a copy of the data at a remote loca- 
tion. The addition of a data mirroring capability is a 
potential solution. The mirroring process of the persis- 
tent storage can be synchronous, asynchronous or 
logged and resynchronized. Local mirroring, also 
known as RAID 1, consists of two disks that syn- 
chronously duplicate each other’s data and are treated 
as one drive. One solution to providing a remote copy 
of the data would be to stretch the channel over which 
the data is mirrored. A single FC-AL disk subsystem 
can be up to 10 kilometers [28] from the host system. 
For some disaster contingency planning, ten kilometers 
may be sufficient. Channel extenders offer potential 
distances greater than ten kilometers [21]. 


An alternative to synchronous mirroring is asyn- 
chronous mirroring (also known as shadowing). In 
asynchronous mirroring, updates to the primary disk 
and mirror are not atomic, thus the primary and mirror 
disk are in different states at any given point in time. 
The advantage of asynchronous mirroring is a reduc- 
tion in the required bandwidth as real-time updates are 
not required and a failure in the mirror does not affect 
the primary disk. The two basic approaches to asyn- 
chronous mirroring are: to take a “‘snapshot” of the 
primary data and use a copy-on-write [32] mechanism 
for updating both the mirror and the primary data; or 
to log updates [33] to the primary data over a period of 
time then transfer and apply the log to the mirror. The 
result of asynchronous mirroring is that the data and 
the mirror are synchronized at a point in the past. 


Commercial HA Solutions 


There have been small investigations [35] into 
the formal aspects of the primary-backup system but 
the traditional two to a few node HA systems have 
been widely used commercially. Many commercial 
cluster platforms support fail-over, migration, and 
automated restart of failed components, notably Com- 
paq, HP, IBM, and Tandem [23], Sun’s Full Moon [36] 
and Microsoft’s Cluster Service [37]. All of the com- 
mercial cluster platforms mentioned offer only single- 
vendor proprietary solutions. Veritas Cluster [38] 
offers a multi-vendor HA solution. 


Commercial products like BigIP [39] from 
FSNetworks or TurboLinux’s TurboCluster [40] have 
been introduced where clustering is used for load bal- 
ancing across a cluster of nodes to provide system 
scalability with the added benefit of high availability. 
These systems are employing novel approaches to 
load balancing across various network components 
and IP protocols. The use of clustering across geo- 
graphically distributed areas is gaining support for 
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building highly available Web Servers [42, 43, 44]. 
The geographic distribution of the Web Servers 
enhances high availability of the Web interface and 
provide for virtually transparent disaster recovery of 
the Web Server. The DR issue with the design is with 
the back-end processor. In the Lyengar, et al. [43] arti- 
cle, the back-end processor was an IBM large-scale 
server in Nagano, Japan without which the Web 
Servers provide only static potentially out of date data. 


High Availability and Disaster Recovery 


On the surface, DR would seem to be an exten- 
sion of HA. HA’s goal is not only to minimize failures 
but also to minimize the time for recovery from them. 
In order to asymptotically approach 100% availability, 
fail-over services are created. One DR strategy would 
be to create a fail-over node at an appropriate off-site 
location by extending the communications between 
the clustered systems over geographic distances 


However, DR and HA address very different 
problems. Marcus and Stern [12] made four distinc- 
tions. First, HA servers are colocated due to disk cable 
length restrictions and network latency; DR servers are 
far apart. Second, HA disk and subnets are shared; DR 
requires servers with separate resources. Third, HA 
clients see a fail-over as a reboot; DR clients may be 
affected also. Finally, HA provides for simple if not 
automatic recovery; DR will involve a complex return 
to normalcy. 


Furthermore, commercial HA solutions assume 
adequate bandwidth, often requiring dedicated redun- 
dant 10 or 100 Megabit channels for a “heartbeat.” 
Data center performance requirements often require 
Gigabit channels for disk access. Even if network 
latency and disk cable length restrictions can be over- 
come with channel extension technologies [21] and 
bandwidth, the recurring communications cost associ- 
ated with providing the required bandwidth to support 
HA clusters over geographic distances is currently 
prohibitive for most organizations. 


The level to which HA technologies can be cost 
effectively leveraged in a DR solution offers some sim- 
plification and risk reduction of the DR process. In order 
to cost effectively use HA technologies in DR, the high 
bandwidth communication channels must be replaced 
with low bandwidth usage. Our focus is to minimize 
required communications between the primary and the 
backup, efficiently utilize the available bandwidth and 
rely on IP QoS mechanisms to insure a stable operational 
communications bandwidth. The next subsection pro- 
vides an overview of IP QoS mechanisms. 


IP Quality of Service 


In order to provide end-to-end QoS, QoS features 
must be configured throughout the network. Specifi- 
cally, QoS must be configured within a single network 
element, which include queuing, scheduling, and traffic 
shaping. QoS signaling techniques must be configured 
for coordinating QoS from end-to-end between network 
elements. Finally, QoS policies must be developed and 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


Geographically Distributed System for Catastrophic Recovery 


configured to support policing and the management 
functions necessary for the control and administration of 
the end-to-end traffic across the network. 


Not all QoS techniques are appropriate for all 
network routers as edge and core routers perform very 
different functions. Furthermore, the QoS tasks spe- 
cific routers are performing may also differ. In gen- 
eral, edge routers perform packet classification and 
admission control while core routers perform conges- 
tion management and congestion avoidance. The fol- 
lowing QoS overview of router support is biased 
toward what is available as part of Cisco’s IOS as 
Cisco routers are used in the case study environment. 
Bhatti and Crowcroft provide a more general overview 
of IP QoS [45]. 

Three levels of end-to-end QoS are generally 
defined by router vendors, Best Effort, Differentiated 
and Guaranteed Service [46]. Best Effort Service, 
(a.k.a. lack of QoS) is the default service and is the 
current standard for the Internet. Differentiated Ser- 
vice (a.k.a. Soft QoS) provides definitions that are 
appropriate for aggregated flows at any level of aggre- 
gation. Examples of technologies that can provide dif- 
ferentiated service (DiffServ) in an IP environment are 
Weighted Fair Queuing (WFQ) with IP Precedence 
signaling or, under IOS, Priority Queuing (PQ) when 
only a single link is required [47]. Guaranteed Service 
(a.k.a. hard QoS) is the final QoS level as defined. 
Guaranteed Service provides a mechanism for an 
absolute reservation of network resources. Integrated 
Services (IntServ) guaranteed service could be config- 
ured using hard QoS mechanisms, for example, WFQ 
combined with Resource Reservation Protocol 
(RSVP) [48] signaling or Custom Queuing (CQ) ona 
single link [49] in a Cisco IOS environment. 


In a router environment, end-to-end QoS levels 
are implemented using features provided as part of the 
router’s operating system. These features typically fall 
into five basic categories: packet classification and 
marking, congestion management, congestion avoid- 
ance, traffic conditioning and signaling. In a Cisco 
router environment, the Internetworking Operating 
System (IOS) provides these QoS building blocks via 
what Cisco refers to as the “QoS Toolkit’’ [50]. 


QoS policies are implemented on an interface in 
a specific sequence [51]. First, the packet is classified. 
This is often referred to as coloring the packet. The 
packet is then queued and scheduled while being sub- 
ject to congestion management techniques. Finally, the 
packet is transmitted. Packet classification is discussed 
next, followed by a discussion of queuing and 
scheduling. Congestion avoidance techniques are used 
to monitor the network traffic loads in an effort to 
identify the initial states of congestion and proactively 
avoid it. Congestion avoidance techniques are not 
used in this project and will not be discussed further. 
The IP Quality of Service section proceeds with a dis- 
cussion of traffic shaping and policing; concluding 
with a discussion of RSVP signaling. 
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In order to implement any QoS strategy using the 
QoS toolkit, the router and version of IOS must support 
the features used. Cisco provides a matrix of IOS ver- 
sions, routers and QoS features for cross-reference [52]. 


Packet Classification and Marking 


Packet classification occurs by marking packets 
using either IP Precedence or the DiffServ Code Point 
(DSCP) [46]. IP Precedence utilizes the three prece- 
dence bits in the IP version 4 header’s Type of Service 
(ToS) field to specify class of service for each packet. 
Six classes of service may be specified. The remaining 
two classes are reserved. The DSCP replaces the ToS 
in IP version 6 and can be used to specify one of 64 
classes for a packet. In a Cisco router, IP Precedence 
and DSCP packet marking can be performed explicitly 
through IOS commands or IOS features such as pol- 
icy-based routing (PBR) and committed access rate 
(CAR) can be used for packet classification [53]. 


PBR [54] is implemented by the QoS Policy 
Manager (QPM) [51]. PBR allows for the classifica- 
tion of traffic based on access control list (ACL). 
ACLs [55] establish the match criteria and define how 
packets are to be classified. ACLs classify packets 
based on port number, source and destination address 
(e.g., all traffic between two sites) or Mac address. 
PBR also provides a mechanism for setting the IP 
Precedence or DSCP providing a network the ability 
to differentiate classes of service. PBR finally, pro- 
vides a mechanism for routing packets through traffic- 
engineered paths. The Border Gateway Protocol 
(BGP) [56] is used to propagate policy information 
between routers. Policy propagation allows packet 
classification based on ACLs or router table source or 
destination address entry use with IP Precedence. 


CAR [57] implements classification functions. 
CAR’s classification service can be used to set the IP 
Precedence for packets entering the network. CAR 
provides the ability to classify and reclassify packets 
based on physical port, source or destination IP or 
MAC address, application port, or IP protocol type, as 
specified in the ACL. 


Congestion Management 


Congestion Management features are used to 
control congestion by queuing packets and scheduling 
their order of transmittal using priorities assigned to 
those packets under various schemes. Giroux and 
Ganti provide an overview of many of the classic 
approaches [58]. Cisco’s IOS implements four queu- 
ing and scheduling schemes: first in first out (FIFO), 
weighted fair queuing (WFQ), custom queuing (CQ) 
and priority queuing (PQ). Each is described in the 
following subsections [53]. 


First In First Out (FIFO) 


In FIFO, there is only one queue and all packets 
are treated equally and serviced in a first in first out 
fashion. FIFO is the default queuing mechanism for 
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above El (2.048 Mb/s) Cisco routers and is the fastest 
of Cisco’s queuing and scheduling schemes. 
Weighted Fair Queuing 

WFQ provides flow-based classification to queues 
via source and destination address, protocol or port. 
The order of packet transmittal from a fair queue is 
determined by the virtual time of the delivery of the last 
bit of each arriving packet. Cisco’s IOS implementation 
of WFQ allows the definition of up to 256 queues. 


In IOS, if RSVP is used to establish the QoS, 
WFQ will allocate buffer space and schedule packets 
to guarantee bandwidth to meet RSVP reservations. 
RSVP is a signaling protocol, which will be discussed 
later in this section, the largest amount of data the 
router will keep in queue and minimum QoS to deter- 
mine bandwidth reservation. 


If RSVP is not used, WFQ, like CQ (see Custom 
Queuing), transmits a certain number of bytes from 
each queue. For each cycle through all the queues, 
WFQ effectively transmits a number of bytes equal to 
the precedence of the flow plus one. If no IP Prece- 
dence is set, all queues operate at the default prece- 
dence of zero (lowest) and the scheduler transmits 
packets (bytewise) equally from all queues. The router 
automatically calculates these weights. The weights 
can be explicitly defined through IOS commands. 
Priority Queuing 

In Cisco’s IOS, PQ provides four queues with 
assigned priority: high, medium, normal, and low. Pack- 
ets are classified in to queues based on protocol, incom- 
ing interface, packet size, or ACL criteria. Scheduling is 
determined by absolute priority. All packets queued in a 
higher priority queue are transmitted before a lower pri- 
ority queue is serviced. Normal priority is the default if 
no priority is set when packets are classified. 


Custom Queuing 


In Cisco’s IOS, CQ is a queuing mechanism that 
provides a lower bound guarantee on bandwidth allo- 
cated to a queue. Up to 16 custom queues can be spec- 
ified. Classification of packets destined for a queue is 
by interface or by protocol. CQ scheduling is weighted 
round robin. The weights are assigned as the minimum 
byte count to be transmitted from a queue in a given 
round robin cycle. When a queue is transmitting, the 
count of bytes transmitted is kept. Once a queue has 
transmitted its allocated number of bytes, the currently 
transmitting packet is completed and the next queue in 
sequence is serviced. 


Traffic Policing and Shaping 


Policing is a non-intrusive mechanism used by 
the router to ensure that the incoming traffic is con- 
forming to the service level agreement (SLA). Traffic 
Shaping modifies the traffic characteristics to conform 
to the contracted SLA. Traffic shaping is fundamental 
for efficient use of network resources as it prevents the 
drastic actions the network can take on non-conform- 
ing traffic, which leads to retransmissions and 
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therefore inefficient use of network resources. The 
traffic shaping function implements either single or 
dual leaky bucket or virtual scheduling [59, 60, 61]. 


Signaling Mechanisms 


End-to-end QoS requires that every element in the 
network path deliver its part of QoS, and all of these 
entities must be coordinated using QoS signaling. The 
IETF developed RSVP as a QoS signaling mechanism. 


RSVP is the first significant industry-standard 
protocol for dynamically setting up end-to-end QoS 
across a heterogeneous network. RSVP, which runs 
over IP, allows an application to dynamically reserve 
network bandwidth by requesting a certain level of 
QoS for a data flow across a network. The Cisco IOS 
QoS implementation allows RSVP to be initiated 
within the network using configured proxy RSVP. 
RSVP requests the particular QoS, but it is up to the 
particular interface queuing mechanism, such as 
WFQ, to implement the reservation. If the required 
resources are available and the user is granted admin- 
istrative access, the RSVP daemon sets arguments in 
the packet classifier and packet scheduler to obtain the 
desired QoS. The classifier determines the QoS class 
for each packet and the scheduler orders packet trans- 
mission to achieve the promised QoS for each stream. 
If either resource is unavailable or the user is denied 
administrative permission, the RSVP program returns 
an error notification to the application process that 
originated the request [62]. 


Study Results 


The case study was intended to evaluate the fea- 
sibility and production implementation options for the 
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proposed DR solution. Our design sought to minimize 
communications requirements through data shadow- 
ing, exploitation of file commonality in file updates, 
network traffic shaping and to ensure system stability 
through IP QoS. Our prototype sought to measure the 
impact of each of the communication limiting tech- 
niques. The measurements were carried out in three 
distinct evaluations. 

e The first evaluation was to determine the effect 
of block level updates verses file updates. 

e The second evaluation was to determine the 
level of network bandwidth efficiency reason- 
ably achievable. 

e The third and final evaluation was establishing 
a configuration that supports the required QoS. 


The test environment is presented next. Followed 
by the evaluations carried out and the issues they 
revealed. This section concludes with the results of the 
evaluations. 


The Test Environment 


A test environment was configured as shown in 
Diagram 3 and was constructed in as simple a manner 
as possible to reduce the cost of the evaluation. The 
entire test environment took about five days to install 
and configure, given that the infrastructure was 
already in place. Operating Systems and applications 
are loaded on the DR servers and updates are made 
manually. The baseline testing took about 90 days to 
gather the data. In the test environment, the OS was 
Solaris 7 and the applications were Oracle, PVCS and 
local configuration management database applications. 
This was the most labor-intensive part of the test set- 
up. One of the production servers was used to 
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Diagram 3: Test environment. 
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maintain the heartbeat between the test systems; and 
Initiate and monitor the asynchronous mirror updates. 
An available Network Appliance filer was setup to 
store the backup data. The primary data used in the 
test was restricted to NAS accessed data. This allowed 
the use of only one commercial data-shadowing prod- 
uct, reducing the cost and complexity of the test. The 
commercial data shadowing product chosen was Snap- 
Mirror [63] from Network Appliance. The filer and 
SnapMirror software were configured in approxi- 
mately a day. 


SnapMirror uses the snapshot [64] facility of the 
WAFL file system [65] to replicate a volume on a 
partner file server. SnapMirror is used in this study to 
identify changes in one of the production data vol- 
umes and resynchronize the remote volume with the 
active. After the mirror synchronization is complete, 
the backup data is at the state of the primary data at 
the instant of the snapshot. The remote mirrors are 
read-only volumes and are not accessible except by 
the source file servers. When a fail-over is to occur, 
the backup volumes are manually switched from a 
read-only standby state to read-write active state and 
rebooted. After the reboot, the backup filer is accessi- 
ble by the DR server and the remote data can be 
mounted. SnapMirror and Network Appliance filers 
were chosen for this test based on their current use in 
the production environment, the availability of a filer 
to use in the test, and their support of block level 
updates allowing a determination of the impact of a 
block level verses file level update policy. The amount 
of data used in the test was constrained by the avail- 
able disk storage on the filer, 120 GB. 


The production application servers, the produc- 
tion NAS filers, the DR application server and the DR 
filer were connected to the shared production Gigabit 
Ethernet LAN. The production and DR filers along 
with the production application servers also have a 
second interface which is attached to a maintenance 
LAN running switched fast Ethernet. The asyn- 
chronous mirroring was tested on the production LAN 
to look for impacts and then reconfigured to run over 
the maintenance LAN. The heartbeat and synchroniza- 
tion initiation was carried out over the production 
LAN. 


Evaluations 


The first evaluation was to determine the effect 
of block level updates verses file updates. This test 
consisted of mirroring asynchronously approximately 
100 GB of production data for 52 days and measuring 
the volume of block level updates required for the syn- 
chronization. The mirror synchronization was initiated 
daily at 3 pm, just after the peak load (see Chart 1). 
Chart 2 shows the weekday daily change rate as a per- 
centage of total data monitored. The mean percentage 
daily rate of change was 2.32% with the minimum 
daily rate of change being 1.12% and a maximum 
daily change rate of 3.69%. 
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The sizes of the files that were modified over 
the update period were summed. This test was 
accomplished by running a perl script over the snap- 
shot data used by SnapMirror. Block level updates 
show a reduction of approximately 50% of data 
required for transfer verses uncompressed copying of 
the modified files. 


The second evaluation was to determine the level 
of network bandwidth efficiency reasonably achievable. 
The mirrored data was traffic shaped at the data source 
using a leaky bucket algorithm provided with the Snap- 
Mirror product. The data shadowing traffic was mea- 
sured at the interface of the DR filer during the syn- 
chronization process. A threshold value (the hole in the 
bucket) of five megabits/second was set creating the 
virtual low bandwidth connection depicted in Diagram 
3 within the LAN’s Ethernet channel. 
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Chart 2: Daily data change rates. 
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Chart 3: Weekday network throughput. 


The asynchronous mirroring occurred over the 
production LAN where there is significant excess 
bandwidth. No other IP QoS mechanisms were used at 
this point in order to see if a constant load could be 
achieved and what impact the addition of this constant 
load would place on the Data Center filers and on the 
LAN. The rate of five megabits/second was selected 
as it was expected that a low bandwidth channel of 
less that five megabits/second would be required to 
update the production DR site given the current one 
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terabyte requirement. The mean effective throughput 
for the test was 3.99 megabits/second. 


When looking at the data, it became obvious the 
overhead of determining the updates to the volume 
and read/write times were significant during periods of 
low data volume change which occur every weekend. 
In addition, the larger the data transferred in the 
update, the higher the network throughput. Removing 
data from each Saturday and Sunday yields an 
increase in the mean throughput to 4.26 megabits/sec- 
ond, which gives a bandwidth efficiency of 85%. 
Chart 3 graphs the weekday throughput of the test. As 
noted in chart 3, the throughput is consistent implying 
efficient network utilization. Throughput and peak 
network loads were measured before, during and after 
the synchronization on weekdays. Throughput, as 
expected, was increased proportionally to the network 
traffic added. Peak network loads were unaffected. 


The second evaluation demonstrated two key 
capabilities. First, the data could be successfully syn- 
chronized between the primary and a remote DR site 
over a low bandwidth channel, in this case five Mbps. 
Secondly, the data required for this synchronization 
could be throttled to efficiently use a low bandwidth 
channel or provide a minimal impact on a shared 
higher bandwidth channel. 


The third and final evaluation was establishing a 
configuration that supports the required QoS. The 
issue is to ensure that shaped data transmitted from the 
source filer is transmitted to the DR filer without addi- 
tional delays. If the network cannot support the data 
rate of the source filer, the network acts as an addi- 
tional queue and introduces delay. This delay intro- 
duces jitter into the shaped data that may prevent the 
synchronization of the data within the fidelity of the 
DR requirements. For example, if the DR requirement 
is to synchronize data hourly, the additional delay may 
cause the synchronization to take more than one hour 
and what is the result? Does the current synchroniza- 
tion fail and the next initiation of synchronization 
start, which could result in never successfully syn- 
chronizing the data? The proposed solution is two 
fold. First, initiation of a data synchronization does 
not occur until the completion of the previous data 
synchronization. If a synchronization has to wait, it is 
initiated as soon as the previous synchronization com- 
pletes. This check was added to the heartbeat, but was 
also discovered to be a feature of the SnapMirror 
product. 


Secondly, to prevent the additional delays in the 
network, IP QoS mechanisms can be used to provide a 
guarantee of adequate bandwidth based on the traffic 
shaping threshold. As previously described, many 
configuration options could be used to meet the QoS 
requirements. In the case study, the SnapMirror prod- 
uct was reconfigured to asynchronously mirror over 
the maintenance network. Custom queuing was 
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enabled on the interface to the source filer and config- 
ured to guarantee 5% of the 100 Mbps link to the mir- 
ror process. The maintenance network is primarily 
used for daily backup to tape, thus its traffic is bursty 
and heavily loaded during off-peak hours (8 pm-6 
am). Network traffic continues to be measured at the 
interface of the DR filer and has continued to operate 
around the 4.26 Mbps level. An excerpt from the IOS 
configuration used in the test follows: 

1. interface serial 0 

2. custom-queue-list 3 

3. queue-list 3 queue | byte-count 5000 

4. queue-list 3 protocol ip | tep 10566 

5. queue-list 3 queue 2 byte-count 95000 

6. queue-list 3 default 2. 
Queues are cycled through sequentially in a round- 
robin fashion dequeuing the configured byte count 
from each queue. In the above excerpt, SnapMirror 
traffic (port 10566) is assigned to queue one. All other 
traffic is assigned to queue two. Entire packets are 
transmitted from queue one until the queue is empty or 
5000 bytes have been transmitted. Then queue two is 
serviced until its queue is empty or 95000 bytes have 
been serviced. This configuration provides a minimum 
of 5% of the 100 Mbps link to the SnapMirror traffic. 


Issues 


Four issues arose during the proof-of-concept 
implementation. The first was ensuring data integrity. 
What happens if the communications line is lost, pri- 
mary servers are lost, etc. during remote mirror syn- 
chronization? 


The SnapMirror product was verified to ensure 
data integrity. Upon loss of network connectivity dur- 
ing a mirror resynchronization, the remote filer data 
remained as if the resynchronization has never began. 
However, an issue arose ensuring database integrity. 
This is a common problem when disk replication 
occurs via block level writes. The problem is the 
database was open during the snapshot operation so 
transactions continue. Furthermore, the redo logs were 
based from when the database last performed a “hot 
backup” or was restarted; thus, could not be applied to 
the backup. A solution to accomplish database syn- 
chronization is to actually shutdown the database long 
enough to take the snapshot. The database and the 
redo logs are then restarted. In the case study environ- 
ment, the entire process takes about four minutes. Liu 
and Browning [34] provide details covering the 
backup and recovery process used. An alternative 
would be to purchase a disk replication package spe- 
cific for the database. 


The second issue was licensing. Several com- 
mercial products required for the DR functions use a 
licensing scheme based on the hostid of the primary 
system. The license key used for the primary installa- 
tion could not be used for the backup installation and 
proper additional licenses had to be obtained and 
installed. 
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The third issue arose in the QoS testing and 
results from excess bandwidth and a reliance on pro- 
duction data to test the design. The maintenance net- 
work is lightly loaded during prime shift, from 6 am to 
8 pm. The daily synchronization is initiated at 3 pm 
and normally completes in about 75 minutes. Since the 
maintenance network has excess bandwidth during the 
synchronization, it is possible that the custom queuing 
configuration has no effect as is indicated by the data. 


This demonstrates the point that the configura- 
tion of bandwidth reservation through IP QoS mecha- 
nisms is only required on the local LAN when the 
local LAN does not have excess bandwidth capacity 
greater than that of the low bandwidth connection 
used for backup site communications. In most cases, 
the bandwidth reservation is insurance that the 
required bandwidth will always be available enabling 
the low-bandwidth link to be fully utilized. The issue 
of excess bandwidth in the testing environment is fur- 
ther evidenced when a synchronization was attempted 
without traffic shaping enabled. Intuitively, the peak 
load induced when a transfer of greater than two GB is 
initiated through a five Mbps queue would slow the 
transfer down. However, it did not as queue one’s data 
continued to be serviced as long as there was no other 
network traffic. 


The resolution of this testing issue is more diffi- 
cult. In order to get valid test results for the quantity 
and types of changes, a production data volume were 
used. This requires non-intrusive testing on the pro- 
duction LANs. While testing on the maintenance net- 
work during backups would be useful to this project, it 
may also prevent production work from completing 
and has not currently been undertaken. 


The final issue was security. Security of the 
remote servers, the remote mirror and communications 
is a topic, which must be further addressed. In the test, 
standard user authentication provides security on the 
remote servers and remote filers. Additionally, a con- 
figuration file, /etc/snapmirror.allow located on the 
primary filer, provides a security mechanism ensuring 
only the backup filers can replicate the volumes of the 
primary filers. The communication channels were over 
existing secure links. These secure links may not be 
available in the final target DR site. 


Data Evaluation 


The final task of the test is to determine the com- 
munication requirements to enable a stable geographi- 
cally distributed storage update policy over a low 
bandwidth link. Since not all the data in the Data Cen- 
ter could be used in the prototype due to storage limi- 
tations, a linear estimation was used to predict the 
time required to perform the data synchronization. The 
assumptions of the model are: 

e The amount of data changed is related to the 
amount of data in use. 
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e¢ The amount of transferred data directly con- 
tributes to the length of time required for data 
synchronization. 

e The relation between these two factors and the 
time required for data synchronization is linear. 


Using the results from the sample data, 85% net- 
work efficiency allows a maximum of 13.84 GB of 
data to be transferred per 24 hours over a dedicated 
1.544 Mbps T1 link. Under the assumption of the 
3.69% maximum daily change rate, a maximum data 
store of 375 GB is supported by this design with a 
24-hour synchronization policy. Under the assumption 
of the 2.32% mean daily change rate, a maximum data 
store of 596 GB can be supported. In order to support 
the required one TB, a minimum of a 4.22 Mbps link 
is required for the maximum daily change rate and a 
minimum of a 2.65 Mbps link is required for the mean 
daily change rate. 


Summary and Future Work 


This paper proposes a design that is the integra- 
tion of several existing HA and network efficiency 
techniques, disk replication products and current IP 
QoS mechanisms, to establish an off-site DR facility 
over a low-bandwidth connection. The paper evaluates 
an approach to minimizing the communication 
requirements between the primary and backup site by 
relying on block level updates by the disk replication 
products to exploit file commonality in file updates; 
network traffic shaping and data shadowing to enable 
efficient network communications; and IP QoS mech- 
anisms to insure that adequate bandwidth is available 
to ensure efficient usage of the low bandwidth link 
and that data synchronization can occur within the 
constraints of the DR requirements. 


The proof-of-concept test developed for the case 
study demonstrated the functionality of the design 
over a reasonably low bandwidth connection of five 
Mbps and also demonstrated that a dedicated T1 link 
was insufficient given a 24-hour update cycle of one 
TB of data with the derived set of usage parameters. 
The proof-of-concept also demonstrated several other 
points about the design. First, the gain from the 
exploitation of file commonality can be significant but 
is of course usage dependent. In general, data replica- 
tion products do not support block level updates and if 
they do, within a single product line. A more generic 
solution appears to be the exploitation of commonality 
at the file abstraction level where data compression 
and the integration of security mechanisms such as 
Internet X.509 [41] can be used for additional reduc- 
tions in required bandwidth and increased security. 
Secondly, traffic shaping the data was demonstrated to 
be a highly effective method to efficiently use the 
available communications on low-bandwidth links. 
Finally, as stated in the previous section, the testing of 
the bandwidth guarantees is incomplete, difficult to 
measure and only required when excess bandwidth is 
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not available or more generally put, as an insurer of 
available bandwidth. Bandwidth reservations are most 
likely to be required when communications are over a 
heavily used or bursty WAN. 


The next steps for this project take two distinct 
tracts. The first involves adding additional remote disk 
capacity, securing an appropriate remote link with a 
SLA of a minimum of five Mbps and testing addi- 
tional disk replication products to support the full data 
set required at the DR site. The second is investigating 
the feasibility of providing an enhancement that offers 
support for asynchronous mirroring of only the modi- 
fied areas of raw data in a compressed, secure manner, 
exploiting file commonality and further reducing 
bandwidth requirements. An enhancement or exten- 
sion to the Network Data Management Protocol 
(NDMP) is being explored as a possible solution. 
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Embracing and Extending Windows 2000 


Jon Finke — Rensselaer Polytechnic Institute 


ABSTRACT 


We were recently presented with the challenge of deploying a large scale Windows 2000 
environment, initially for the Administration Division, but eventually including academic and 
other users. Rather than try to eventually re-integrate independently administered domains, we 
took this as an opportunity to develop the tools and resources to provide a campus-wide Windows 
2000 environment that is well integrated with the existing enterprise information and computing 
systems, much like we integrated our Unix systems. This would automate many of the mundane 
administrative functions, yet provide appropriate delegation of contro] to departmental admin- 
istrators as needed. This paper describes the systems we developed to make this happen. 


Introduction 


Rensselaer recently embarked on a major build- 
ing initiative: a BioTech research center, an electronic 
media and performing arts center, a new central boiler 
and chiller plant, a parking garage, and a new campus 
entrance, and with this new construction come some 
new requirements for information sharing and archiv- 
ing. The Administration division decided to handle 
this with a Windows 2000 Exchange email system. 
The good news is that they came to the Chief Informa- 
tion Officer to ask for help. The bad news is that the 
CIO agreed to help. 


With this new initiative, we thought that it was 
very important to get this new system deployment 
right “‘the first time,’’ as attempting to go back later 
and fix things would be very difficult. We also wanted 
to look beyond the requirements of this specific pro- 
ject, and deploy a campus-wide solution to integrate 
support of academic programs as well as other admin- 
istrative units. Our existing Windows administrators 
were spread over a number of administrative units, as 
well as a few academic departments. It was important 
to come up with a system that they would be comfort- 
able with, and one where they would be willing to 
“turn over control” to the central computing center. 
This was quite a change in approach for our Windows 
administrators." 


At Rensselaer, we have a long history of 
automating our Unix systems administration tasks (via 
an Oracle database) and striving to get information 
from an authoritative source. Figure 1 gives a high 
level view of the flow of people-related information at 
Rensselaer. Human Resources and student record data 
flows from the administrative system running 
SCT/Banner into the Simon system’s [4, 5] Oracle 
database. One of the things this is used for, is to auto- 
matically create and expire Unix/email user accounts. 
From Simon, the relevant information is sent to each 
of the various client information and authentication 


1During some training on MS Exchange, we learned that 
Microsoft invented Kerberos. 
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systems, including the central telephone directory 
(LDAP, ph and web based), the ID card system, AFS 
and Kerberos, and the new Windows 2000 domain and 
Exchange email system. 


In addition, we have developed many tools and 
techniques that allow us to delegate responsibility with 
a great deal of fine-grained control. For example, our 
telephone directory system [6] allows a person from 
each department to maintain directory information for 
their staff. Our eventual proposal was actually built on 
top of several existing systems, with some enhance- 
ments and extensions. 
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AFS/Kerberos 


Windows 
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Figure 1: Information flow at Rensselaer. 





One of our existing projects was to provide a 
comprehensive LDAP [9] directory service, and make 
this the directory service of choice for information con- 
sumers. To this end, the data needs to be both accurate 
as well as timely. Since it appeared that our LDAP ser- 
vice would be driving our Active Directory service, this 
project gained a key role in our Windows 2000 project. 


Another existing project was single signon, or at 
least some form of password synchronization between 
systems. While we may have a bunch of different sys- 
tems behind the scenes, our users think that they have 
a single account and password and we provide the 
magic to make it all work. To this end, we wrote a 
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web-based password changing tool. Unlike the one 
developed at Auburn University [10], we use public 
key encryption to protect and store the passwords; and 
we use a relational database to manage and store both 
the public keys and the encrypted passwords, as well 
as to drive the actual password changing process on 
Windows, Oracle and Kerberos. 


In our original Windows 2000 proposal, the 
organization hierarchy would be based on the Univer- 
sity structure, as defined in the telephone directory. 
However, after meeting with some of the Windows 
administrators, we saw that we needed a way to allow 
these folks to create their own structure, yet keep them 
from interfering with other administrators. Fortu- 
nately, our existing telephone directory model pro- 
vided a good fit in both respects, and we decided to 
extend it to handle the Windows domain. 


The Institutional Layer 


While there are many technical challenges to 
implementing a system like this, institutional politics 
add a number of other “opportunities to excel.” In our 
case, we were fortunate that we had addressed many of 
these in the past, having long established the practice of 
feeding data from Human Resources and the Registrar 
into the Simon system to automatically create Unix and 
email accounts for everyone on campus. This same sys- 
tem was grown to manage the campus telephone direc- 
tory and feed the campus ID card system. With these 
projects in place, we had established our credibility in 
delivering campus-wide information services. 


One thing that helped us a great deal was always 
insisting on going to the authoritative source for all 
data elements. All student information comes from the 
Registrar, employee information from Human 
Resources, and so on. If we have changes to that data, 
we pass it back to those systems. Because we are not 
trying to maintain our own “student database” or 
‘“‘employee database,” we can insist that the “owners” 
of the data share it with us. We have been able to 
sweeten the pot in many cases by providing some ser- 
vices to them based on their data. 


As information systems evolve, we sometimes 
move away from this ideal because user demands may 
outpace the ability of support organizations to react to 
changes. As a result, we have discovered that several 
important data elements that should have been main- 
tained centrally had moved into the Simon system.? 
We now have an ongoing task force monitoring the 
relationship between Simon and the rest of the admin- 
istrative information systems. 


Care and Feeding of LDAP 


We begin by describing the development of our 
existing LDAP-related information flow, which provides 


2If you are publishing the enterprise phone directory, you 
WILL get good data feeds, and the rest of your projects can 
benefit. 
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the base upon which the Windows implementation is 
built. We needed a way to detect changes to data in our 
enterprise information systems and propagate just those 
changed records to LDAP. Since we wanted these 
changes to be as close to “real time” as possible, the 
process needed to be very lightweight. By making the 
load on the enterprise database? very small, we can 
make frequent checks for changes without annoying the 
DBAs or impacting performance on that system. 


To make things more interesting, we had to do 
this in such a way to make a minimal impact on the 
keepers of the enterprise database. These MIS folks 
have a great many demands on their time, and are often 
not available to take a large role in the deployment of 
new services. This necessitated that our approach 
require very little involvement of the DBAs and devel- 
opers in the MIS department.4 


Simon (Oracle) 


Levevevsveversucveweverneseere™ 
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Figure 2: Information change flow architecture. 
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In our initial LDAP deployment, we were popu- 
lating a general “people” directory following the 
‘eduPerson’’> [3] to replace our PH white pages direc- 
tory server. A second LDAP project was to provide a 
“POSIX” account space to provide /etc/passwd to our 
Unix hosts. We did this by generating LDIF files using 
a PL/SQL package [13]® and some interface code to 
generate these into files [7]. While generating LDIF 
files to test the LDAP servers was okay, this was not 
going to handle the ongoing updates that we needed. 


Triggers and Change Queues 


A powerful feature of Oracle and other high end 
databases is the ability to define triggers [1] that will 


3We run SCT/Banner for our student records, finance, and 


payroll. 

4While this is a local condition, I suspect over-booked MIS 
shops are a fact of life at many other sites. 

5This is an effort by EDUCAUSE to come up with a set of 
standard attributes for a directory entries at educational sites. 

6PL/SQL Packages are a collection of functions, procedures, 
cursors, and variables. These can be both public and private. 
Access to the public procedures can be granted to Oracle 
users or roles. Once accessed in a session, a package main- 
tains state between calls. 
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automatically execute when a particular table is 
changed in some way. This gives us a way to insert our 
own business rules in the general operation of the 
database. For example, our code will get executed when 
a person is added to the people table. What we did is 
record key information about the change into a queue 
table that would be processed later. Our intention here 
was to place some of these triggers into our main 
administrative database, SCT/Banner. In this way, we 
would not need to change the vendor code, which 
should make handling new releases much easier. For 
applications we write ourselves, we have the option of 
using triggers or putting these calls directly in the appli- 
cations themselves. 


In Figure 2, we have triggers “watching” for 
changes to the Logins and the People tables. When a 
change in a table is detected, we write a record into 
the Meta_Change_Log table (see Table 1). We include a 
number of identifying fields. The first three - Tname, 
Subtype and Change_Type — identify which table, and 
in some cases, which part of the table was changed 
and how it was changed. We also have a number of 
fields to identify which entry was changed. Since 
many of the changes we are interested in involve peo- 
ple, we include the primary keys used by each of our 
two main information systems. However, we may be 
looking for changes to something other than people 


















Tname Varchar2 
Subtype Varchar2 


Rrowid Rowid 


Change Type | varchar2 
PIDM Number 
Person_Id Number 
Pkey_String Varchar2 


Pkey_Number | Number 


Identifies the table that was changed 

Optional subtype to allow specialized processing based on 
which fields in a table were changed. 

The Rowid (Oracle record identifier) of the row that was 
changed. This allows for direct access to the desired row 
with no searching needed. 

A flag indicating Insert, Modify or Delete of the record. 
Person identifier for our administrative system (SCT/Banner) 
Person identifier for the Simon system. 

A table specific character string to identify the record in 
question (such as a Unix account name). 

A table specific numeric value to identify the record in ques- 
tion (such as a Unix UID). 
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(department names, group membership, etc.) so we 
have a generic character and numeric key available for 
tables dealing with non “people” information. The 
exact set of identifiers used depends on the table. 


The second half of the Meta_Change_Log (see 
Table 2) deals with how we process these records and 
manage the queue. Each entry has an entry date and a 
sequence number to assist with ordering. We are feed- 
ing several systems with changes: our LDAP directory 
server, our Windows 2000 Active Directory Server 
and our Photo ID Card system. We need to be able to 
process each of these queues independently, since they 
may be operating with different schedules for backup, 
maintenance, etc. All of these queues has a ‘‘Process 
Needed”’ flag and a process date. When a record is 
inserted into the queue, the appropriate ‘‘Process 
Needed” flags are set to ““Y”’. 


Each of the _Proc flags has an index on it. When 
a record is processed, this flag is set to Null. Oracle do 
not index Null values, so that only the active (change 
pending) records are included in the index. This means 
that, in normal operation, these indexes are very small 
and can be accessed and updated very quickly. This 
keeps the load on the database to a minimum and 
allows us to make frequent checks to look for changed 
records. 


Table 1: Meta_Change_Log Table — Identification. 


Entry_Date Date 
Entry_Number Number 
LDAP_Proc_Date Date 


The time and date when the record was entered. 
An ever increasing sequence number. 
The date when the LDAP SyncLing processed this record. 


LDAP_Proc Varchar2 


ADSI _Proc_Date Date 
ADSI Proc Varchar2 


IDCARD_Proc_Date Date 


IDCARD_ Proc 


Varchar2 


A flag set to “Y” when this record is awaiting processing by 
LDAP. 

The date when the ADSI synch program processed this record. 
A flag set to “Y” when this record is awaiting processing by 
ADSI. 

The date when the IDCARD synch program processed this 
record. 

A flag set to “Y” when this record is awaiting processing by 
IDCARD. 


Table 2: Meta_Change_Log Table — Queue processing. 
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SyncLings 

Now that we had a list of changes that needed 
processing, we also needed a way to get them into 
LDAP. To this end, we wrote an Oracle package called 
LDAP_Sync. This package references the Meta_Change_ 
Log table and provides two procedures, Get_Changes 
and Ack_Change. The LDAP_SYNC package is called by 
a Java program called a “SyncLing.” It calls the 
Get_Changes routine to get the next change, processes 
it, and then marks that change as done with the 
Ack_Change routine. 


The Get_Changes routine (see Table 3) is called, 
and the next LDIF record [8]’ is returned. This is then 
compared to the existing record in the LDAP server by 
the SyncLing, and the appropriate action is taken. As 
each record is processed, the Ack_Change routine is 
called with the Rec_Id. This will mark the record as 
processed (clearing the LDAP_Proc flag in the table). 
This process is repeated until Get_Changes returns a 
null record. At this point, the SyncLing pauses for a 
few seconds, and then resume asking. 


As part of the process of the initial data load of 
the LDAP server, we had already written a package to 
generate LDIF records for each person at Rensselaer 
from the directory database, so it was a trivial matter 
to call this routine from the Get_Changes routine. 


We are actually populating several “trees” in our 
LDAP database: a general people tree, and a POSIX 
account tree (in part, to replace our /etc/passwd file and 
some group files). There are corresponding routines to 
generate flat LDIF files that we can call for each of 
these cases. The type field identifies the type of LDIF 
record being returned, so that the processing program 
knows what to do. It can also ask for just records of a 
given type, or it can ask for all types of records. 


This system works very well for our Unix sys- 
tems for which it is deployed. In the next section, we 
discuss how we extended it to service our Windows 
2000 systems. 


Feeding of Active Directory 


Our initial plan was to use our LDAP server to 
feed the Active Directory® server. However, this ran 
into a few problems. Although we were feeding both a 


7One method of loading data into an LDAP server is using 
a format known as the ‘““LDAP Data Interchange Format” or 
LDIF. 

8Active Directory is the database server used by Windows 
2000 to hold user information and passwords. 


[Name [re 


RType Varchar2 | In/Out 
































Out 
Out 


Varchar2 
Varchar2 


Rec_Id 
LDIF_Record 


On call, specifies the type of records desired, and on 
return, indicates the type of record being returned. 

Returns a record id used for the Ack_Changes routine. 
An LDIF formatted record with the information for 
the next person to be processed. 
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POSIX user id base and the general person/directory 
information into LDAP in essentially real time, the 
data did not match well with the requirements of 
Active Directory and our Exchange server. Although 
we were able to load people into Active Directory via 
LDAP, we ran into limitations due to our focus on the 
‘“‘eduPerson’’; fields we needed in Active Directory 
were not available. 


The second problem is that we needed to propa- 
gate password changes from our general system into 
the Windows 2000 world. Our password changing 
scheme (see section below) relies on public key 
encryption to secure the passwords while in transit; 
implementing this via LDAP was not practical. 


We had recognized early in the project that 
LDAP was not going to be able to handle the pass- 
word changing, so we had started a second project to 
manage the passwords. At that time, Microsoft was 
moving to the use of JAVA in some aspects of their 
systems, so we started development of a JAVA pro- 
gram that would talk to the database; get the password 
changes, and then apply them to Active Directory. 
When it became clear that our LDAP approach was 
not going to handle our needs, we expanded the role of 
the password propagation to a more general Active 
Directory update system. 


The database side of the LDAP solution described 
above was exactly what we needed, so we simply 
cloned it. In fact used the same table, and simply added 
an ADSI (Active Directory Service Interface) [11] 
propagation flag to duplicate the LDAP propagation 
flag, and duplicated the PL/SQL package to provide 
ADSI with its own interface to the database. This 
allows for ADSI and LDAP to run in parallel. 


The ADSI_Sync package starts out much like the 
LDAP_Sync package with a Get_Changes and Ack_ 
Change routine, and they operate in the same way as 
described above. However, rather than returning an 
LDIF record, we just return the username. We have also 
added two more routines, Get_Dirinfo (see Table 4) and 
Get_Dirinfo2 (see Table 5). The rollout of this service 
was under some pretty tight time pressure, which is 
why these are two separate routines, rather than just 
one. 


As the Windows 2000 service was being rolled 
out, and more administrative users were being added 
to our exchange server, some other issues were dis- 
covered. We were already using the Windows 2000 
user base and password for students to access our 







Table 3: Get_Changes procedure parameters. 
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public workstation labs, so the password changing 
system was being well exercised, but as administrative 
users were moving their email from our Unix-based 
pop mail service to the Exchange server, we needed to 
feed more information into Active Directory. 


One of the other objectives of this project was to 
provide Exchange based mailing lists for all the peo- 
ple in a given department or division. Our first pass at 
this was to provide a pair of routines, Get_Divisions and 
Username_By_Division; the first would return a list of 
all of the divisions, and the second would return a list 
of everyone in the specified division. In the same way, 
we also provided Get_Departments and User- 
name_By_Department routines. These would be used to 
maintain the desired membership lists. However, we 
quickly ran into a problem with the names being used; 
different aspects of the Active Directory service took 
exception to some of the special characters we were 
using such as ampersand, dash, and a few others. To 
handle this, we added a CLEAN flag to the Get_D... 
procedures and a Get_Username_By_Clean_D. .. call. 


Once our users had tasted the wonders of auto- 
matically maintained mailing lists, they were hungry 
for more. As part of a different project, we had put 
together some special mailing lists for our ListProc 
machine such as ‘‘Deans, Directors and Department 
Heads” and ‘Building Coordinators.”’ These lists 
were being dumped as flat files using our Generate_File 
program. A little bit of creative coding, and these lists 
became available to ADSI via the Get_Specials and 
Get_Username_By_Special. 


To be Sharp, You Must C Sharp 


When we first started working on a JAVA pro- 
gram to change passwords, it appeared that Microsoft 
would provide (JAVA) Class libraries which exposed 
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their ADSI routines. As it turns out, they ended up 
abandoning their JAVA efforts, and we were forced to 
provide our own JAVA callable ADSI routines by 
using JNI (Java Native Interface) to wrap a Windows 
DLL written in C. Although it worked, it was very dif- 
ficult to change, and our development staff dreaded 
new requirements from the Windows team. Every new 
function would take 30 minutes to write, and then four 
days to debug and tweak so it would actually work. 


Microsoft’s new direction was a programming 
language called C# [12], part of their Visual Studio pack- 
age. With C#, Microsoft has provided the ADSI class 
libraries we need, allowing us to eliminate the need for 
a custom DLL. This has proven much easier to work 
with, and we have resumed adding new fields and 
streams into Active Directory from our central 
database. This shows up as the C# ADSI box in Figure 2. 


Making Changes to Passwords; Here, There and 
Everywhere 


When we first rolled out our campus-wide Unix 
service, built on top of an AFS filestore, we wrote a 
replacement for the general Unix passwd program that 
would update the Kerberos password and save a copy 
of the Unix PASSWORD crypt in our central 
database. This enabled us to build conventional 
/etc/passwd files for a few legacy systems that did not 
use Kerberos (this was 10 years ago). 


As our systems evolved, many of our users 
moved away from the Unix workstations for their 
computing, but continued to use their Kerberos pass- 
words for email, printing, dial-up and other services. 
Instructing people to connect to a Unix machine 
(using ssh, not telnet!) sign on, and invoke the passwd 
program to change their password resulted in a lot of 
frustration for both our users and our help desk staff. 


[Name type[sargton 


Uname Varchar2 
Pref_Email Varchar2 
Camp_Phone Varchar2 
Camp_Fax Varchar2 
Varchar2 
Varchar2 
Varchar2 
Varchar2 


Camp_Address 
Department 
Division 
Web_Page 
Title Varchar2 


Target username — provided by Get_Changes or other routines. 
Preferred email address. 

Campus telephone number. 

Campus fax number. 

Campus address. 

Department Name. 

Division Name. 

URL of person’s web page, if available. 

Title of person (employees only — not students.) 





Table 4: Get_Dirinfo procedure parameters. 





Uname Varchar2 
























Varchar2 
Varchar2 
Varchar2 


Last_Name 
First_Names 
Preferred_First_Name 


a 


Target username — provided by Get_Changes or other 
routines. 

Person’s last name. 

Person’s first and middle names. 

Alternate first name preferred by the person. Used in 
the directory. 









Table 5: Get_Dirinfo2 procedure parameters. 
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A web based password changing system was an obvi- 
ous solution. 


Although we had long ago stopped collecting a 
Unix crypt, we still liked the idea of collecting pass- 
words for use on other systems, such as a trouble tick- 
eting system (Oracle based), our academic Oracle 
servers, and our Kerberos 5 server. We were not com- 
fortable with the idea of storing, or even transmitting, 
clear text passwords. While we could protect the web 
session with SSL, we still had the traffic from the web 
server to the database server. So we decided to use 
public key encryption, encrypt the plain text password 
with a public key on the web server, and transmit the 
encrypted text to the database for processing and stor- 
age. Since we were already connecting to the database 
for other reasons, this was a logical spot to store the 
public keys. As part of this process, the password 
change web page also signals (via Oracle signals) [2] 
the back end password processing. 


Using Oracle to broker the password changes 
makes it much easier to add new authentication ser- 
vices. The web page used by the users does not need 
to change, nor does it need to understand every new 
type of authentication system. To add a new authenti- 
cation system, we just need to write a new back end 
that understands that world, along with some public 
key encryption and database access. It also allows for 
people to request password changes when some 
authentication services are not available; the change is 
held until the service is restored. 


In Figure 3, we have data flow for a password 
change. We start with a web page, connecting to our 
secure web server via an SSL connection (step 1). The 
secure web server does some password strength 
checks, and, if things are okay, makes an immediate 
password change to our AFS Kerberos server (step K). 


Finke 


There is also an option to simply test a new password 
to ensure that it will pass the strength tests. 


Once the password change is checked out, the 
web server calls an packaged routine in the database, 
Get_Public_Key (step 2). This public key is used to 
encrypt the clear text of the user’s new password, and 
then that encrypted text is stored back in the database 
(step 3) via the Store_Pw procedure. This stores the 
encrypted text, the user name, and the key number in 
the Encrypted_Passwd_Cache table (Table 6). 








AFS Windows 
Browser Server Controllers 





Encrypted 
Password 
Cache 






Figure 3: Password change flow. 


Using the same sort of propagation polling that 
we used for the ADSI_Sync process, we have Windows 
Password Changer program running on our Windows 
2000 Password Master machine. It looks for entries in 
the Encrypted_Passwd_Cache table (step 4) that have 
Windows_Prop_Pending set to “‘Y.” Along with the 
principal name, and the encrypted password, it also 
gets a key number. It then consults it’s own list of pri- 
vate keys (step 5) and uses that to decrypt the 


ize 
32 
1024 


Varchar2 
Varchar2 
Number 
Date 
Varchar2 
Date 
Number 
Varchar2 
Date 
Number 
Varchar2 


Principal_ Name 
Encrypted_Key 

Key Num 
Entry_Date 
Simon_Prop_Pending 
Simon_Prop_Date 
Simon_Err_Code 
Applix_Prop_ Pending 
Applix_Prop_Date 
Applix_Err_Code 
Kerb5_Prop_Pending 


The user name 

The encrypted user password. 

The key number used to encrypt the password. 

The time and date of the change request. 

A flag indicating that this password needs to go to Simon. 
The time and date when this change was given to Simon. 
A numeric error code from this operation. 

A flag indicating that this password needs to go to Applix. 
The time and date when this change was given to Applix. 
A numeric error code from this operation. 

A flag indicating that this password needs to go to Kerberos 


version 5 server. 


Kerb5_Prop_Date Date 


The time and date when this change was given to the Ker- 


beros version 5 server 


Kerb5_Err_Code 
Windows_Prop_ Pending 
Windows_Prop_ Date 
Windows_Err_Code 


Number 
Varchar2 
Date 
Number 


A numeric error code from this operation. 
A flag indicating that this password needs to go to Windows 
The time and date when this change was given to Windows. 
A numeric error code from this operation. 





Table 6: Encrypted_Passwd_Cache table. 
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password and make the change on the Windows 2000 
domain controller (step 6). Once that has been done, it 
then updates the Windows_Prop_Pending flag (step 7), 
as well as the Windows_Prop_Date field. It continues 
getting more rows to process until there are no more to 
be done. At that point, it will “sleep” for 30 seconds, 
and then check again. While this seems pretty quick, 
there is some delay in propagating the password 
changes between the Windows Domain Controllers, so 
it can take up to 15 minutes for the changes to get 
passed among the Windows servers. We have not 
found any ways in which to improve this. 


Key Generation 


There are a number of options and approaches to 
managing the public/private key pairs. For our Win- 
dows 2000 world, we have a key generator program 
that we run on the Windows 2000 Password Master 
machine which generates a key pair. The private key is 
stored in a secure location on the Password Master 
machine (and yes, we really need to keep this machine 
secure), and the public key is stored on the Oracle 
database server. This makes it available to the Secure 
Web Server when needed. The public keys are stored 
in the Passwd_Key_List table (Table 7). 


We have not spent a lot of time working on our 
key management procedures, but we periodically gen- 
erate new key pairs and remove the old private keys 
from service. The use of key numbers and types help 
keep things in order for us. However, since all of the 
applications (the secure web server and the back end 
processors) all rely on the Oracle procedures for key 
management, we can change those routines and the 
external code will do the right thing without any 
changes. 


Multiple Streams 


In the case discussed above, we are keeping two 
passwords in sync: our AFS Kerberos password and the 
one for our Windows 2000 server. In actuality, we are 
also keeping the Oracle password on the Simon server, 
as well as the Oracle password on our trouble ticketing 
system (Applix). We are maintaining our Kerberos ver- 
sion 5 server passwords via this approach. 


When the secure web server encrypts a pass- 
word, it actually does it twice, once with a key used 
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for the Windows Password Changer, and a second 
time using a different key pair used by the Simon sys- 
tem. So, each password change results in two entries 
being made in the Encrypted_Passwd_Cache table. 
When the Windows key is used, the Windows_Prop_ 
Pending flag is set, and when the Simon key is used, 
the Simon_Prop_Pending, Applix_Prop_Pending and 
Kerb5_ Prop Pending flags are set. Since these three 
queues are all processed on the Simon server machine, 
they are able to share the same public and private 
keys. 


At present, our Kerberos 5 server is not yet in 
production. By using its own propagation flag, 
changes to other servers can still go through when the 
Kerberos 5 server is not available. When service is 
restored, it can catch up on the changes. We also have 
the ability to re-encrypt the clear text to feed to other 
systems. This allows us to keep most of the processing 
on the back end server, rather than making additional 
encryption runs on the secure web server. 


Timing Issues 


Although we would like all of this to take place 
instantly, the back end program is doing decryption 
and re-encryption, as well as accessing the database. 
In addition, the program is actually run by another job 
scheduling system that is sometimes busy with other 
functions (creating accounts, changing quotas, etc.). 
We also have the problem that the system we are try- 
ing to update is sometimes down or unreachable, so 
the password change is stuck waiting in the queue. 
This was starting to result in some operational prob- 
lems where a person would change their password, 
and then immediately try to access a service and get 
an authentication failure. 


To assist our help desk (and advanced users), we 
wrote another web page which can display the exact 
date and time of a person’s last password change, and 
when it was applied to each authentication base. It will 
also display if there is a change pending. This also lets 
us generate some statistics on the time it takes for 
password changes to propagate. The page will first 
report on any pending changes, and then report on the 
most recent set of completed changes. It produces a 
report like Table 8. 


Key_Num Number 
PK_N Varchar2 
PK_E Varchar2 
Active Varchar2 


Start_Date Date 
End_Date Date 
Create_Date Date 


Key_Type Varchar2 


The key number. 

Part of the public key. 

The other part of the public key. 

A flag indicating that this key is the active key for 
this type. 

When a key goes into service. 

When a key is removed from service. 

When a key was created. 

What “TYPE” of key this — used to identify who has 


the private key that matches this key. 
Table 7: Passwd_Key_List table. 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 





71 


Embracing and Extending Windows 2000 


There are a couple of oddball entries. For exam- 
ple, the Kerb 4 change always has an elapsed time of 0 
seconds: if it fails, none of the Oracle processing will 
take place, and no logs will be written for display. Of 
course, the user will get a very clear error message on 
the web page when this happens. The other odd case is 
the AdminReg. This is a case where the user lost their 
password and went to the help desk and asked that 
their password be reset. This system uses a similar 
mechanism to the ones discussed here, and sometimes 
it is subject to delays. 


Other password issues 


For long term storage of “‘clear-text” passwords, 
in order to let us populate services yet to be deployed, 
we have the option of storing the appropriate private 
keys on secure storage (such as a floppy disk locked 
up in safe), and when we need to load the initial popu- 
lation of a new service, we get the floppy, restore the 
private keys, and load the new system. 


Systems Administration as a White Pages problem 


At LISA X, I presented a paper [6] showing how 
managing the University white pages (phone direc- 
tory) was really a Systems Administration problem. 
Things have now come full circle, and we are apply- 
ing our telephone directory tools and techniques to 
managing our Windows 2000 domains. 


For our telephone directory, we take a data feed 
from Human Resources, decide (based on status and 
department) if they are included in the directory, or if 
they need to be moved to another part of the directory 
tree. We also have facilities available to manually 
move someone to a different department, have some- 
one appear in a second or any number of departments. 
What is more, our tools allow us to delegate control of 
any department or subtree to folks outside of the com- 
puter center. 


This is the same problem we wanted to solve 
with Windows 2000 user accounts. We wanted them to 
be created and expired automatically, we wanted new 
accounts to be put in the appropriate domains by 
default, and we wanted to allow our Windows 2000 
administrators to have some ability to shuffle folks 
around, but not give every admin the ability to move 


9Our baseline organization structure is based on our finan- 
cial accounting system. We modify it slightly to reflect the 
actual organizational structure. 
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every account. The tools used for our telephone direc- 
tory could be used for this with almost no changes. 


Futures 


Having the common Windows/Unix/Kerberos 
account and password base is very nice, in that it allows 
us to deploy new services requiring authentication and 
have some options with how we do it. Since all of the 
IDs and passwords are the same, the users do not know 
or even care how it is done. At present, we require the 
users to change their password in order to access Win- 
dows 2000 and some other services. We like this from a 
security perspective, as the initial password is stored in 
clear text and may have been printed. On the other 
hand, this creates an extra step for new users. However, 
it appears that user convenience has won out over secu- 
rity, and we will soon be removing the requirement to 
change the initial password. 


We do have a web based account pickup system 
in place that does put new users right on the password 
change web page once they get their initial password. 
Many of the new users take that opportunity to change 
their passwords. Of the 1596 new users who used the 
web pickup tool, 1467 (92%) changed their passwords. 


We will continue to add new fields into Active 
Directory from our enterprise information systems. 
Now that the tools are in place and understood, it is 
much easier add new things. 


Problems and Lessons 


One problem is our general job queuing mecha- 
nism, so we will be investigating adding a special 
password change thread or simply multiple threads to 
keep password changes from getting stuck behind 
other jobs. A priority setting for jobs might be enough, 
since, in general, any given job is pretty quick. The 
problem comes when a password change gets stuck 
behind 1200 user account creation requests the prob- 
lems arise. 


Working with public key encryption can be very 
tricky and we have run into problems with how keys 
are generated and stored. We have two different ways 
of generating key pairs, and although both store the 
public keys in Oracle, and we can successfully encrypt 
with either key, the private keys were not interchange- 
able. For example, if I generated a private key under 
AIX, it would not work in Windows. We have subse- 
quently learned why this was, and have fixed this. We 












Table 8: Password status sample. 
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have also hit problems with packages working under 
one version of AIX on a particular type of processor, 
and, when recompiled on a different machine, not 
working at all. Some of the packages we are using 
appear to have some buffer overruns and other mem- 
ory layout issues. 

A Big Hammer 


The requirement for a high end database server 
and some of the other infrastructure may make this solu- 
tion “too big” for some small site to contemplate. How- 
ever, I feel that part of this is a matter of perception. 
When we first embarked on the Simon project, using 
Oracle added a big expense for software and hardware, 
and a lot of development effort. On the other hand, the 
challenge of trying to deliver an enterprise wide com- 
puting environment was simply too big to accomplish 
without using commercial support. This project will not 
help a small operation; it is simply too big a hammer. 
But if you are working with tens of thousands of user 
accounts, and a large turnover in population, you have 
no other choice but to go for the big tools. 


While the Simon system is involved in many 
facets of our information systems, one of the most 
important roles it plays is as an interface layer 
between different vendor applications. It gives a place 
to apply our business rules and needs to the areas 
where the vendor products fall short. Part of the cost 
of doing this is having a development staff who can 
make these systems work together. 


While it is certainly nice to think that we can do 
everything with free software, I do not feel that that is 
realistic approach to a large scale system. At our site, 
we spend close to one million dollars a year in soft- 
ware licensing and support contracts. Software license 
costs are simply a cost of doing business. We don’t 
seem to have a problem obtaining four servers for our 
LDAP directories, so a database server shouldn’t be 
any different. 


References and Availability 


All source code for the Simon system is avail- 
able on the web. See http://www.rpi.edu/campus/ 
rpi/simon/README.simon for details. In addition, all 
of the Oracle table definitions as well as PL/SQL 
package source are available at http://www.rpi.edu/ 
campus/rpi/simon/misc/Tables/simon.Index.html. We 
can make the ADSI (C#) routines available as well. It 
seems unlikely that we can distribute the public key 
encryption routines used in the password management, 
but presumably you can find them elsewhere. 
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ABSTRACT 


Stem is a system administration “enabler.” It is not an administrative tool, but rather a 
general-purpose development framework that allows an administrator to craft tools to perform a 
wide variety of tasks in a distributed environment easily and quickly. Many common tasks can be 
performed with Stem scripts involving a few lines of declarations. Current example applications 
include log file collating and service load balancing. Using Stem, a non-programmer can craft 
reliable network software in a few lines of declarations that would require hundreds or thousands 
of lines of code in a traditional programming language such as C. 


Introduction 


It is a bit difficult to begin describing something 
that is rather unique. Stem is not an administrative 
tool, nor does it support a particular kind of adminis- 
trative practice. It is instead a “framework” for 
declarative construction of networked software for a 
variety of purposes. Its intent is to make network soft- 
ware that is as easy to create as it is to describe. In 
many cases, Stem allows a system administrator to 
create such tools without learning to program by bas- 
ing them upon a few templates that describe common 
network applications. 


Stem’s components include: 

e A declarative configuration language by which 
one can define application components, known 
as ‘‘Cells.”’ Stem is written in Perl, and the cur- 
rent configuration directives use Perl syntax. 

e A runtime daemon that creates and executes 
components (cells) based on the defined config- 
uration. 

e A set of modules that implement useful, com- 
monly used cells. 

e Additional utility modules and command tools. 


While it remains a general-purpose programming 
framework, Stem’s primary goal is to help system and 
network administrators solve their problems with less 
work. The is the concept of “enabling” which will be 
a theme throughout this paper. Stem enables system 
administrators to easily glue existing systems together 
with network connections, create new networked 
applications with much less coding than before, and 
configure solutions to many common problems with- 
out any coding at all. 


Stem is a general purpose system that can be 
used in many situations and problem spaces. By using 
the existing Stem modules and example configurations 
you can focus on your own problem space and issues 
and ignore many common network programming 
problems such as event handling and client-server and 
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other interprocess communication. If your problem 
space is large, Stem enables you to cover that entire 
space under one architecture, thereby simplifying your 
design, coding and maintenance. 


To Code or Not to Code 


While most administrators write at least the 
occasional script, many (perhaps even most) do not 
have the time or skills to develop full-blown network 
applications from scratch. Stem enables both groups 
of administrators equally. Non-programmers can cre- 
ate Stem configurations or modify the supplied exam- 
ples without any coding needed to solve their specific 
problems, simple or complex. Administrators who 
know Perl can create modules for functions not pro- 
vided by existing Stem modules and still take advan- 
tage of the infrastructure and network services that 
Stem offers, allowing them to limit their programming 
to just their specific task. The advantages of this 
approach, with respect to simplifying and speeding up 
application creation, are obvious. A site can even split 
up the work, with a developer creating new Stem 
modules, and an administrator creating the configura- 
tion to drive and deploy them. 


Stem Networking 


Gluing together disparate existing applications is 
a common and difficult problem when attempting to 
automate system administration tasks. Applications 
can have command line interfaces (CLI), be 
client/server based, or use common protocols (HTTP, 
SMTP, etc.). Stem enables an elegant solution to this 
task. Stem can be used to wrap each existing applica- 
tion in a module along with a new, message-based 
interface (essentially, an API). However, there is no 
need to predefine message queues or compile parame- 
ters as you do in many other message-passing or RPC 
type systems. Once this is done, Stem declarations 
invoke the module under conditions you specify. 


This approach allows the task of gluing together 
applications to be divided into two phases: first, creat- 
ing a wrapper with a message-passing interface and 
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second, using the new module within Stem, allowing 
the cells to communicate automatically with one 
another. The first task is performed only once, and the 
resulting module is reusable, and generic, while the 
second task allows the module to be quickly put to 
work on a specific task. 


Related Work 


Stem is a unique integration of several kinds of 
technology, but has its roots in several other tools that 
contain part, but not all, of its capabilities. 


Message-passing Systems 


First, Stem is a “message-passing system,” but 
applying that name implicitly limits it more than is 
appropriate. For a programmer, the term ‘“‘message- 
passing” generally refers to parallel programming 
libraries such as PVM and MPI [mpi]. Stem differs 
from such facilities in that it allows network applica- 
tions to be created at a conceptually “higher” level and 
handles all of the lowest level message-passing func- 
tionality transparently to the application creator. Also 
Stem’s messages can be sent to any cell in the current 
application, whether in the same hub (process), system 
or to a remote site. In fact, changing where a message is 
sent usually amounts to simply editing an address in a 
configuration file with no coding involved. 


Another common use of the term ‘“‘message-pass- 
ing” is to refer to a class of commercial products com- 
monly called Message Oriented Middleware (MOM). 
These products include guaranteed message delivery 
among their capabilities, which also typically include 
access to databases and transaction systems and other 
related services. Such products are usually targeted to 
the financial and business community and are rarely 
used by administrators and network developers. A few 
of the more well known MOM products include IBM’s 
WebSphere MQ Family and Microsoft’s MSMQ [mom]. 


Stem differs from MOM applications in several 
ways. First of all, MOM systems are designed for use 
by professional application programmers and usually 
require substantial programming expertise to use. In 
contrast, Stem is designed for use by administrators 
for administrative tasks while minimizing required 
programming. Secondly, traditional MOM systems are 
extremely large and entail considerable overhead, 
from both a computing and a staffing point of view. 
MOM systems typically need a database system to 
function, and some MOM vendors (e.g., IBM) recom- 
mend at least one full time staff person dedicated to 
running them. Stem is at the other extreme in that it is 
extremely lightweight. Finally, while Stem is Open 
Source, existing MOM applications are commercial 
products which are both expensive and proprietary. 


Administrative Tools 


Second, Stem is intended as an “‘administrative 
tool” but has almost nothing in common with existing 
administrative tools such as Cfengine [cfengine]. 
These tools are intended primarily to contro] the 
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configuration of hosts within a network. Stem is 
instead intended to allow flexible interoperation 
between tools within a network. While Cfengine pro- 
vides network communications layers so that hosts can 
exchange information, this information is limited to 
facts relative to host configuration. It cannot, for 
example, hand off a service request from one host to 
another, an operation that is trivial in Stem. 


The Swatch package [swatch] overlaps to some 
extent with one of Stem’s modules. This package 
monitors log file contents and searches for specified 
patterns set in its configuration file. The Stem::Log- 
Tail module performs a very similar function. 


Monitoring Tools 


Stem is also not a monitoring tool. It is better to 
say that Stem is a framework that makes it easier than 
ever to create applications which collect any desired 
system and network data. Thus, it can subsume many 
of the data collection capabilities of these tools. Stem 
could also be used to feed data to existing monitoring 
tools like RRDTool [rrdtool] or Cricket [cricket], 
enabling the administrator to extend their capabilities 
while continuing to take advantages of these tools’ 
mature visualization capabilities. 


Other Facets 


Stem has something in common with many other 
tools and approaches. Stem’s ability to function as an 
application wrapper has its roots in many other mes- 
sage-passing and “‘screen-scraping”’ systems. Its basic 
philosophy of creating a distributed configuration 
engine that responds to flexible events was first docu- 
mented in Distr [distr], though this mechanism was 
intended solely for file distribution. 


Stem Architecture 


To encompass so many facets of other work, 
Stem has evolved a unique architecture based upon a 
biological metaphor of “Stem Cells.” A Cell is the 
fundamental building block for a network application. 
In a running Stem application, one or more cells exist 
as objects in a Stem “hub”; a hub corresponds to a 
single daemon process. Multiple hubs can run simulta- 
neously, and they communicate with one another via 
constructs called “portals” that use TCP/IP sockets. 
Hubs can communicate with other hubs running on the 
same system or any network-accessible host, with 
Stem handling all of the interprocess communication. 


Stem is a fully event driven system. Events can 
be any of the common network operations such as 
socket connections, I/O on character devices (termi- 
nals, sockets, pipes, etc.), and timers. Stem uses a con- 
sistent technique for the delivery of messages based 
on all kinds of events. 


Cells 
Stem cells are addressable objects which can send 


and receive messages. Cells are registered at creation 
time by name. There are three kinds of Stem Cells: 
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e Class Cells correspond to a Stem hub-wide 
(process) resource. These Cells are usually self- 
registering and generally use the class name as 
their identifier but more intuitive aliases may 
also be defined. 

¢ Object Cells are application global objects. 
Most often, they are created and registered by 
the configuration which the Stem system is run- 
ning, but they can also be loaded or created at 
runtime. They are generally long lived and last 
as long as the Stem hub is running. 

¢ Cloned Cells are additional instances copied 
from an existing parent object cell. They share 
the parent’s Cell name but are additionally 
given a unique target name which assigns them 
a unique address. Cloned cells are similar to 
what other systems would call sessions. They 
are dynamically created upon request and last 
only as long as needed. 


The heart of Stem is the messaging subsystem 
and the heart of that is the registry. This is where all 
knowledge of how to address cells is located. Each 
cell is registered by its name and if it is a cloned cell, 
also by its target name, and messages are directed to it 
via these names. 


Messages 


Stem Messages are how Cells communicate with 
each other. Messages are simple data structures with 
two major sections, the address and the content. The 
address contains the Cell name that the message is 
directed to and which Cell sent it. The content has the 
message type, command and data. 


Message addresses are name triplets of Hub/ 
Cell/Target. The Cell name is required and the Hub 
and Target are optional. These triplets form globally 
unique addresses with the overall Stem system. 


The Message address section has multiple 
address fields. The two primary fields correspond to 
the common email headers and are called ‘to’ and 
‘from’. The ‘to’ address designates which Cell will get 
this message, and the ‘from’ address says which Cell 
sent this message. 


The Message content has information about this 
message and any data being sent to the destination 
Cell. The primary attribute is ‘type’ which can be set 
to any string, but common types are data, cmd (com- 
mand), response and status. Stem modules and Cells 
can create any Message types they want. The other 
major attribute of the content is the data, which holds 
a reference to the message data. 


Modules 


The various Cells classes are implemented as Perl 
modules, and many useful Cell types are included with 
the Stem package. These are among the most important: 

e Stem::SockMsg: Cells that connect to/accepts 
connections from sockets. These cells function 
as a socket-to-Stem message gateway. 
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e Stem::Switch: Multiplexes Stem messages to 
multiple destinations according to maps which 
can be dynamically modified. 

e Stem::Portal: Manages connections between 
Stem hubs, facilitating message transmission 
across the network (including authentication 
and security functions). 

e Stem::TtyMsg: Provides a TTY interface to a 
Stem hub. 

¢ Stem::Proc: Creates Cells that fork external pro- 
cesses and manages them. 

e Stem::Log: Writes and manages Stem logs 
(which may be associated with external files). 

e Stem::Log::Tail: Monitors active external files 
(typically log files), sending newly acquired 
data into the Stem logging subsystems on either 
a periodic basis or on demand. 

e Stem::Cron: Creates and manages scheduled 
message submissions. Such messages can be 
sent anywhere in a Stem network and can trig- 
ger any Stem operation. If the message is 
addressed to a Stem::Switch Cell, it can then be 
sent to multiple destinations and trigger events 
across the network from a centralized schedule. 


In addition, Stem includes other utility modules 
which provide services to active Cells. These services 
include asynchronous I/O, cloning of cells, flow con- 
trol or local and remote method calls and logical 
pipes. 

Configurations 

Stem differs from all other networking toolkits 
by being architected around configuration rather than 
software. A configuration file instructs the Stem 
engine which Stem cells to construct and register. 
When you invoke Stem, it interprets this configuration 
and creates cells as needed without any need to com- 
pile networking code. 


Configuration files are structured using Perl 
objects. Each desired cell is specified as a Perl object 
with a list of attributes. Several examples are dis- 
cussed in the next section. 


Example Application: An inetd-like Server 


We will consider a few versions of a small Stem 
application, chosen for its use of typical Stem compo- 
nents as well as in response to the space limitations 
imposed on this paper. None of them require any pro- 
gramming on the part of the system administrator. 


A Simple First Version 


This version of the application serves as a good 
starting point for understanding Stem. It uses only 
existing Stem modules to create an application which 
can execute a process on a local or remote system. As 
such, the only task required to create the application is 
to set up a file containing the Stem configuration. The 
simplest version is given in Listing 1. 
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This configuration uses three cells: 

e A Stem:TtyMsg Cell. This Cell is used to enable 
commands for the hub to be typed in at the key- 
board. In this configuration the default attributes 
are used. 

e A Stem::Proc Cell named ‘“‘mon’’. This section 
of the configuration file will create a Cell that 
starts a process on demand. Here we specify a 
list of arguments for the Cell: the path to the 
command to be run, and some generic Cell 
attributes. The latter specify that the cell is to 
be cloned and that it will only send the data 
from its process when it exits. 


i uptime.stem 


[ 


class => ’Stem::TtyMsg’, 
args => [], 
Ts 
[ 
class => ’Stem::Proc’, 
name => ‘mon’, 
args => [ 
path => '/usr/bin/uptime’, 
cell_attr => [ 
‘data_addr’ => ’A’, 
*’send_data_on_close’ = l, 
]y 
Th 
Nis 
[ 
class => ‘'Stem::SockMsg’, 
name => 'A’, 
args => [ 
port => 6666, 
server = 1, 


cell_attr => [ 
*data_addr’ => ’A’, 

13 

IN 


Listing 1: Simplest configuration file. 


e A Stem::SockMsg Cell named ‘A’. This Cell 
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will listen on a socket (the port address is speci- 
fied by the port attribute). When a connection 
request is accepted, it will create a logical pipe 
to the ‘mon’ cell as specified in the ‘pipe_addr’ 
attribute. 


Monitor 


BS 
s\Escape character is '*]', 


3 users, 
3 users, 


11:42pm up 19 days, 11:57, 
| 11:43pm up 19 days, 11:58, 





load average? 
load average? 
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In this example when a socket connection is 
made, the logical pipe to ‘mon’ is created, which causes 
the ‘mon’ cell to clone and fork the uptime program. Its 
output is collected and then sent back to the ‘A’ send 
and then on to the socket. Stem will take care of creat- 
ing and sending all of those messages automatically. 


Running this configuration requires the follow- 
ing commands: 


$ xterm -T Stem -n Stem \ 
-e run_stem uptime 
$ xterm -T Monitor -n Monitor \ 
-e telnet localhost 6666 


We use two xterm commands to make Stem’s 
operations visible. The first command starts the Stem 
hub daemon process and attaches the Stem::TtyMsg 
Cell to it so commands can be entered. The second 
command attaches a second window to port 6666, the 
Stem::SockMsg Cell, using a telnet command. Figure | 
illustrates the resulting windows. 


In the “Stem” window, we send the cell_trigger 
command to the mon Cell. This causes the Stem::Proc 
Cell to execute its command. Note that the output 
appears in the ‘Monitor’? window attached to port 
6666 each time the cell is triggered. 


A Piped Version 


The problem with the example above is that you 
must enter a ‘cell_trigger’ command to make the pro- 
cess execute and only one telnet session can be used. 
This new version (see Listing 2) will make the process 
execute when the telnet connects to the socket. Note it 
uses the ‘pipe_addr’ attribute which will create a logi- 
cal pipe between the cell ‘A’ and the ‘mon’ cell. Actu- 
ally the ‘mon’ cell will be cloned when a pipe to it is 
created and that cloned cell will run the process. 


To run this example, name the configuration file 
up3.stem and run this command: 


$ xterm -T Stem -n Stem \ 
-e run_stem uptime2 


Then, in another window run the telnet command: 
$ telnet localhost 6666 


Each time you run telnet you will invoke the uptime 
command and see its output from telnet which will 
then exit. 





Figure 1: A simple Stem application. 
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A Multi-Hub Version 


The previous two Stem applications were 
extremely simple, but their potential functionality is 
very powerful. They can be extended in several ways: 

e The process can be executed on a different host 
than the triggering host. 

e The process can be executed on multiple 
remote systems. 

e The output can be sent to more than one desti- 
nation: multiple sockets on different systems, a 
log file, another Stem cell for further process- 
ing, and so on. 

¢ A different command or program can be run. 
The uptime command merely serves as a proof- 
of-concept here. Any command that is needed 
could be executed. 


#uptime2.stem 


[ 
class => ’Stem::TtyMsg’, 


args => [], 
ie 
[ 
class => ’Stem::Proc’, 
name => 'mon’, 
args => [ 
path => '/usr/bin/uptime’, 
cell_attr => [ 
*cloneable’ => 1, 
’send_data_on_close’ => l, 
lis 
Is 
i 
[ 
class => ‘'Stem::SockMsg’, 
name => '’A’, 
args => [ 
port => 6666, 
server => 1, 
celiater =) | 
pipe_addr => ‘mon’, 
N53 
i 


Listing 2: Improved with concurrent sockets. 


¢ More than one command can be supported by 
defining multiple Stem::Proc cells. In this way, 
the application can function in a similar way to 
inetd in that it can start any one of a number of 
preconfigured servers upon demand. 

e The process can be triggered other ways than 
by manually entering a command or by con- 
necting to a socket: by a different Stem cell, 
according to a schedule (using Stem::Cron), etc. 
The triggering can come from the server hub, 
the client hub, or elsewhere in the Stem system. 
Listings 3 and 4 illustrate a multi-hub version of 

this application. Notice how easy it is to split the sim- 
ple version into an implementation which can be run 
across the network. 
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To run this application, start processes like these 
(as before): 
$ xterm -T Server -n Server \ 
-e run_stem uptime_server 
$ xterm -T Client -n Client \ 
-e run_stem uptime_client 


Then, in another window run this command: 
S$ telnet localhost 6666 


This will behave the same as the uptime2 example but 
it is split over two hubs. Notice that other than adding 
the Hub and Portal cells, the only change to the con- 
figuration was adding a hub name to the ‘pipe_addr’ 
attribute in the uptime_client configuration. This illus- 
trates how easy it is to distribute applications written 
in Stem across a network. Sending messages to local 
or remote cells is done the same way — typically only 
the address will need to be changed. 


This example application and its variations pro- 
vide some insight into Stem’s flexibility and capabili- 
ties. The Stem distribution comes with several other 
example applications and a cookbook that shows you 
how to create your own cells. 


#fuptime_server.stem 


j## Name the hub so the client can 
jf refer to it. 


[ 
class => ’Stem::Hub’, 
name => ‘uptime_server’, 
args => [], 


class => 'Stem::TtyMsg’, 
args => [], 
iss 


i# Set the portal to be a server: 

## listen for portal connections from 
## any host the port defaults to 10000 
## but can be set here 


[ 
class => ’Stem::Portal’, 
name => ’listener’, 


args => [ 
’server’ = 1, 
*host’ =’ 


class => ’Stem::Proc’, 


name => ’mon’, 
args => [ 
path => '/usr/bin/uptime’, 
cell_attr => [ 
*cloneable’ => 1, 
’send_data_on_close’ => l, 


Listing 3: Multi-hub server. 
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Critique and Analysis 


The major infrastructure work of creating Stem 
has been completed, and the package is working well 
where it has been deployed. The design has proven to 
be as flexible as was intended, and Stem applications 
have been created by administrators unfamiliar with 
the package after just a few hours. 


Stem’s planned extensibility has also been veri- 
fied in that administrators have successfully written 
additional Stem modules in Perl and integrated them 
with those that the package provides. 


Stem’s highly modular design has been proven to 
be instrumental in extending it. New modules can eas- 
ily be created and integrated. Internal services have 
been developed and quickly used by other modules 
and cells. Its simple message-passing API allows 
almost any external application, service or protocol to 
interact with any other. 


#tuptime_client.stem 


d# Name the hub so server can refer to it. 


[ 
class => ’Stem::Hub’, 
name => ‘uptime_client’, 


args => []., 

dis 

[ 
class => 'Stem::TtyMsg’, 
args’ => "Ll, 


## Create a client portal. 
## this will connect to the portal 
## in the ‘monitoring’ hub the default 
## host is ’localhost’ but it can be 
## set here the port defaults to 
## 10000 but can be set here 
[ 
class => ’Stem::Portal’, 
name => ’server’, 
args => (J, 


class => ’Stem::SockMsg’, 
name => 2h, 
args => [ 
port => 6666, 
server => 
cell_attr => 
cloneable 
pipe_addr 


1, 


| ee 


Listing 4: Multi-hub client. 


Stem’s configuration file format currently takes 
the form of Perl data structures. This has performance 
implications and security issues as well as presenting a 
somewhat eccentric interface to non-Perl literate 
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system administrators. In the future, we plan to sup- 
port multiple configurations formats including XML. 


Similarly the format of Stem’s message (when 
serialized over a pipe) also needs to support other for- 
mats. But as with the configuration formats, it is just a 
matter of having modules that can convert the internal 
message structure to/from an external format. This 
will allow other systems to be more easily integrate as 
they can then send/receive Stem messages. 


Stem’s security support currently is weak. We 
have demonstrated to ourselves that we can use ssh for 
message-passing between Stem processes but the 
design was not good enough for production. We have 
plans to redesign it to be integrated with Stem’s socket 
module so that any IPC (not just message-passing) can 
use it. Also the design would allow a choice of secure 
transport (ssh, SSL etc.) by using the same modular 
plug-in design as mentioned above. 


Future Work 


Stem is extremely modular in design and can be 
extended easily in many directions. Here is a short list 
of some items that are in our development queue now: 

e State Machine: A text based state machine that 
will take input from multiple sources: socket, 
processes and messages. It will have many fea- 
tures including state callbacks, input and output 
buffers, regular expression matching, and the 
like. 

e Flow Control: A module that will allow a Stem 
Cell to control the logic flow of method calls, 
regardless of whether they are local or remote. 
It manages a combination of synchronous 
(local) and asynchronous (remote via mes- 
sages) object method calls in a simple mini-lan- 
guage that will have the common flow control 
operations such as IF/ELSE, WHILE, etc. This 
greatly simplifies the task of coordinating dis- 
tributed operations upon an object (a Stem 
Cell) such as accessing a database or using sub- 
processes and remote protocols. 

¢ Network Protocols: When the State Machine is 
finished, it will be used in various protocol 
modules which will enable Stem to communi- 
cate with programs which use the popular pro- 
tocols such as HTTP, FTP, SMTP, etc. 

¢ GUI-based Configuration Tool: A longer term 
goal is to integrate Stem with a GUI toolkit 
such as Tk or Qt in order to develop a tool for 
creating and visualizing Stem configurations. 
Doing so will also make it possible to create a 
wide range of GUI front ends for Stem based 
applications. 


Availability 


Stem is Open Source software, it is licensed under 
the GPL and is available without charge from Stem 
Systems. Our website is http://www.stemsystems.com . 
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ABSTRACT 


The computational requirements for the new Large Hadron Collider are enormous: 5-8 
PetaBytes of data generated annually with analysis requiring 10 more PetaBytes of disk storage 
and the equivalent of 200,000 of today’s fastest PC processors. This will be a very large and 
complex computing system, with about two thirds of the computing capacity installed in “regional 
computing centres” across Europe, America, and Asia. 


Implemented as a global computational grid, the goal of integrating the large geographically 
distributed computing fabrics presents challenges in many areas, including: distributed scientific 
applications; computational grid middleware, automated computer system management; high 
performance networking; object database management; security; global grid operations. 


This paper describes our approach to one of these challenges: the configuration management 
of a large number of machines, be they nodes in large clusters or desktops in large organizations. 


Introduction 


The European Organization for Nuclear Research 
(CERN) is building the Large Hadron Collider (LHC), 
the world’s most powerful particle accelerator. From the 
LHC Computing Grid Project home page (http:// 
cern.ch/LHCgrid): 

The computational requirements of the experi- 
ments that will use the LHC are enormous: 5-8 
PetaBytes of data will be generated each year, the 
analysis of which will require some 10 PetaBytes 
of disk storage and the equivalent of 200,000 of 
todays fastest PC processors. Even allowing for 
the continuing increase in storage densities and 
processor performance this will be a very large 
and complex computing system, and about two 
thirds of the computing capacity will be installed 
in “regional computing centres” spread across 
Europe, America and Asia. 

The computing facility for LHC will thus be imple- 
mented as a global computational grid [10], with 
the goal of integrating large geographically dis- 
tributed computing fabrics into a virtual comput- 
ing environment. There are challenging problems 
to be tackled in many areas, including: distributed 
scientific applications; computational grid middle- 
ware, automated computer system management; 
high performance networking; object database 
management; security; global grid operations. 


This paper describes our approach to one of these 
challenges: the configuration management of a large 
number of machines, be they nodes in large clusters or 
desktops in large organizations. 


Large Scale System Administration 


Many solutions for managing a few machines do 
not scale well. When dealing with thousands of 
machines, some problems start to overwhelm. 
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Automation 


It is fine to reinstall your home PC by booting 
from an installation diskette and typing a few com- 
mands but you certainly do not want to do it on a large 
cluster. Similarly, it is acceptable to log into one 
machine and purge /tmp by hand but this could also be 
automated using a tool like Red Hat’s tmpwatch. 


Technical solutions exist in many domains (remote 
power control, console concentration, network booting, 
unattended system installation, package management, 
etc.) so manual interventions can and should be limited 
to the absolute minimum. What remains to be done by 
hand is to configure the programs that will automate 
these otherwise manual tasks. 


This approach has an interesting side effect. Sys- 
tem configurations are known to rot with time because 
ad hoc system interventions tend to accumulate small 
mistakes until the system malfunctions, An unattended 
but complete reinstallation from scratch (not a restore 
from backup) is the most cost effective way to get rid of 
the problem. All you need is to make sure that the con- 
figuration of the installer is kept up to date, which is not 
difficult to do. 


Abstraction 


Once we have reached the full automation of the 
installation process, we can define the configuration of a 
machine as the sum of all the configurations of all the 
programs used either during the installation or after- 
wards. This will contain everything from disk partition- 
ing information to system or software configuration 
(networking, user accounts, X server, etc.). 


You should not include all the things that can be 
configured but rather the ones that wi// be configured. 
For instance, if you do not need to change the 
/usr/share/magic file used by the file command, consider 
it as a (static) data file that comes with the file package 
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itself and therefore outside of the machine configuration 
abstraction. 


We then want to reason about these machine con- 
figurations and, for instance, express the fact that two 
machines have the same disk model or the fact than one 
thousand machines belong to the same batch cluster. It 
is tedious to use the native configuration files, e.g., 
letc/services to describe the known network services or 
letcicrontab for cron’s configuration, because of the 
duplication of information and the variety of convoluted 
formats. We really need an abstraction of this informa- 
tion that is easy to use. 


The way the configuration information is 
abstracted and represented is very important. This is in 
line with Eric S. Raymond’s advice in The Cathedral 
and the Bazaar [16]: “Smart data structures and dumb 
code works a lot better than the other way around.” 


It is quite expensive to come to a good abstrac- 
tion but it really pays off when managing many 
machines. It is a virtuous circle: the more data you put 
in the abstraction, the more useful it becomes. 


Single Database for Multiple Tools 


In theory, a unique system administration tool is 
better than a set of unrelated and often overlapping 
tools such as AutoRPM, cfengine, RDist, LCFG, etc. 
In practice, such a mythical beast does not exist and 
system administrators use the tools’ combination 
which is adapted to their needs, often with a pinch of 
home made scripts with “glue languages” such as Perl 
or TCL. 


It is good to combine the strengths of these tools, 
but the variety of their configuration formats is a big 
disadvantage. Information is duplicated and often 
cumbersome to maintain. Until recently, the machines 
in our computer centre were drawing on information 
from more than twenty different sources, from flat 
files to real databases. Mistakes when handling these 
files (e.g., adding a machine and forgetting to update 
one file) were a common source of problems. 


A good approach is to use a single source of 
information (a central configuration database) and 
simple programs that can transform this information 
into the format understood by the tools used. 


Change Management 


In the first eight months of its life, Red Hat 
Linux 7.2 had 311 updated RPMs. This is more than 
one updated RPM per day on average. Just looking at 
security, Red Hat issued 59 security advisories for the 
same system during the same period, almost two per 
week on average. In large computing centres, changes 
will occur frequently so you must manage them ade- 
quately. 


Every component (be it hardware or software) 
has a non-zero failure probability so, statistically, large 
computer centres have a high probability of having, at 
any point in time, one or more components not 
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working properly. This is especially true when using 
cheap commodity hardware. Some machines will 
always be down or somehow unreachable, so configu- 
ration changes have to be deployed asynchronously. 


Moreover, some critical processes cannot be 
interrupted and intrusive system management tasks 
(such as changing the kernel in use) have to be 
deferred until the system is ready to accept the 
changes. 


The consequence is that you should not try to 
configure a machine directly but rather change its con- 
figuration (stored outside of the machine) and let the 
machine bring itself into line whenever it can. This 
can be called convergent or asymptotic configuration 
(see for instance [18]): the machines independently try 
to come closer to their “‘desired state.” 


Validation 


We should keep in mind this slight modification 
of one of Murphy’s laws: anything that can go wrong 
will go wrong more spectacularly with central system 
administration. 


Having a central database holding machine con- 
figurations and letting thousands of programs on 
remote machines use it is very powerful but mistakes 
can have disastrous consequences. It is of paramount 
importance to control the changes and detect mistakes 
before it is too late. Advanced means of validating the 
stored information must be in place. 


The good news is that the abstraction mentioned 
above really helps. Once you can reason about 
machines and their configuration parameters, it is easy 
to express constraints such as “for all the machines, 
the filesystems mounted through NFS must be 
exported by the corresponding server.” The example 
given in Appendix C describes exactly this constraint 
in our Pan language. 


Validation should not only be seen as a way to 
prevent mistakes, it can also be used to make sure that 
things are really the way you want them to be. Con- 
straints can be used, for example, to ensure that all 
machines have enough swap space, or that they are 
running the correct version of some software. 


Our Solution 


Overview 


The Fabric Management Work Package (WP4 
(21]) of the European Union DataGrid Project (EDG 
[8]) seeks to control large computing fabrics through 
the central management of their “desired state” via a 
central configuration database (one per administrative 
domain). This information will then be used in differ- 
ent ways. 


For the initial system installation (we currently 
use Red Hat Linux 7.2), it will be used to create the 
various files needed to fully automate this process, for 
instance DHCP entries and Kickstart files. 
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For the system maintenance (we currently use 
LCFG), the configuration information will be directly 
used by a number of modular ‘“‘component”’ scripts 
which are responsible for different subystems, such as 
“mail configuration” or ‘web server configuration.” 
The components are notified when their configuration 
changes and are responsible for translating the abstract 
configuration into the appropriate configuration files, 
and reconfiguring any associated daemons. 


Machines will be self healing thanks to sensors 
reporting information to a monitoring database and 
actuators using this information to trigger recovery 
actions such as restarting a daemon or, in extreme 
cases, triggering a full reinstallation of the machine. 
Through the inclusion of hardware information in the 
configuration database, we can also detect problems 
such as a dead CPU or stolen memory. 


Configuration Database 


The configuration database [4] stores two forms 
of configuration information. One is called the High 
Level Description [5] and is expressed in the Pan lan- 
guage. The other is the Low Level Description [11] 
and is expressed in XML. Both are explained below. 


The system administrators can edit the High Level 
Description, either directly or through some scripting 
layer. The Low Level Description (one XML file per 
machine) is always generated using the Pan compiler. 


The XML machine configuration is cached on 
the machine (to support disconnected operations) and 
access is provided through a high-level library [15] 
that hides the details such as the XML schema used. 


The database itself includes a scalable distribu- 
tion mechanism for the XML files based on HTTP, 
and the possibility of adding any number of backends 
(such as LDAP or SQL) to support various query pat- 
terns on the information stored. It should scale to mil- 
lions of configuration parameters. 


Low Level Description 


Mapping a configuration abstraction to a tree 
structure is quite easy. This is the natural format for 
most ‘organized information,” from files in a filesys- 
tem to the Windows registry or LDAP. We call this the 
Low Level Description (LLD). 


Simple values (like strings or numbers) form the 
leaves of this tree and are called properties. Internal 
nodes of the tree are called resources and are used to 
group elements’ into lists (accessed by index) or 
named lists (aka nists, accessed by name). Nlists can 
conveniently be used to represent tables or records?. 
Every element has a unique path which identifies its 
position in the tree. 


We chose XML to represent this tree in a file 
because it maps well to the hierarchical structure of 


‘The term e/ement refers to either a property or a resource. 
21.e., similar to Pascal’s record or C’s struct. 
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the information and it is easy to parse and to validate 
(with XML Schema). Here is a small example repre- 
senting some hardware information (a larger example 
can be found in Appendix A): 


<?xml version="1.0" encoding="utf-8"?> 
<nlist name="profile"> 
<nlist name="hardware"> 
<nlist name="memory"> 
<long name="size">512</long> 
</nlist> 
<list name="cpus"> 
<nlist> 
<string name="vendor"> 
Intel 
</string> 
<string name="model"> 
Pentium III (Coppermine) 
</string> 
<double name="speed"> 
853.22</double> 
</nlist> 
K/1iSt> 
</nlist> 
</nlist> 


Starting with the toplevel XML element (named 
“profile”), you can see on the fifth line the property 
describing the memory size: its path is /hardware/mem- 
ory/size and its value is the long integer 512. Similarly, 
/hardware/cpus is the resource representing the list of 
CPUs. The path /hardware/cpus/0° identifies the first 
(and only) CPU which is represented using a nlist that 
holds a kind of record or structure describing the CPU. 
The model of the CPU is the string at path /hard- 
ware/cpus/0/model and its value is ‘Pentium II (Cop- 
permine).” 

The way this information appears in the XML 
file is dependent on the programs using it. For 
instance, if you have only a few X server configura- 
tions but a large variety of resolution settings, you 
could have something like: 

<nlist name="profile"> 
<nlist name="system"> 
<nlist name="x"> 
<string name="XF86Config" 
type="fetch"> 


http://config.cern.ch/XF86Config-ATI64-19 


</string> 

<list name="modes"> 
<string>1280x1024</string> 
<string>1024x768</string> 

</list> 

</nlist> 
</nlist> 
</nlist> 


The special fetch type is known by our system 
and the programs accessing the configuration informa- 
tion through our API will simply see the contents of 
the file at the given URL as a string. A program 


3In paths, numbers are used to identify list items, the first 
one having the index 0, like in C. 
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responsible for managing the X configuration would 
simply have to start with this base file, substitute in 
the desired modes and write the result to 
letc/X11/XF86Config. 


High Level Description 


Although the previous XML representation is 
sufficient for the programs running on the target 
machines, we need a High Level Description (HLD) to 
reason about groups of machines and share common 
information. 


Existing tools (such as m4) only cover some of 
our requirements so we decided to design our own lan- 
guage to represent the HLD and we wrote the accom- 
panying compiler transforming this HLD into LLD 
(i.e., XML). 


We believe (in line with Paul Anderson’s A 
Declarative Approach to the Specification of Large- 
Scale System Configurations [2]) that a declarative 
approach? to configuration specification is better 
suited than a procedural one>. Pan has been designed 
to stay as declarative as possible while allowing some 
form of procedural code, which is required to take full 
advantage of the power of validation. 

The following is a quick overview of the salient 


features of the Pan language. The complete language 
specification is available in another document [5]. 


41.e., describe how things should look like in the end. 
51.e., describe the sequence of actions to be performed 


## definition for the disk IBM DTLA-307030 
structure template disk_ibm_dtla_307030; 


"type" — disk": 
"vendor" = “IBM's 
"model" = "DTLA- 307030" 


"Size" 29314; # MB 
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Overview 


Pan mainly consists of assignments, each of 
which sets some value in a given part of the LLD 
identified by its path. The following code can be used 
to generate the LLD shown earlier. The left hand side 
of the assignment is the path and the right hand side is 
the value. nlist is a builtin function that will return the 
nlist made from its arguments. 

"/hardware/memory/size" = 512; 
"/hardware/cpus/0" = nlist( 
"vendor", "Intel", 
"model", "Pentium III (Coppermine)", 
"speed", 853.220, 
he 


Pan also features other statements like include 
(very similar to cpp’s #include directive) or delete that 
can delete a part of the LLD. 


The grouping of statements into templates allows 
the sharing of common information and provides a 
simple inheritance mechanism. A structure template is 
used to represent a subtree of information (for instance 
a given disk) while an object template represents a real 
world object (the compiler will generate a separate 
LLD for every object template encountered). 


Listing 1 shows a partial example of two cluster 
nodes that share most of their configuration information. 


Types 


Pan contains a very flexible typing mechanism. 
It has several builtin types (such as boolean, string, long, 


j## definition for the hardware Elonex 800x2/512 


structure template pc_elonex_800x2_512; 


"vendor" = "Elonex"; 

"model" = "800x2/512"; 

"cpus" = list(create("cpu_intel_p3_800"), create("cpu_intel_p3_800")) ; 
"memory/size" = 512; # MB 

"devices/hda" = create("disk_ibm_dtla_307030") ; 


## definition for the Venus cluster 
template cluster_venus; 


"/hardware" = create("pc_elonex_800x2_512") ; 


## and any other hardware or system information shared by all the 


## members of the Venus cluster 


## first machine 

object template venus001; 

include cluster_venus; 
"/hardware/serial" = "CHO1112041"; 


## second machine 

object template venus002; 

include cluster_venus; 
"/hardware/serial" = "CHO1117031"; 
## the first disk has been replaced 


"/hardware/devices/hda" = create("disk_quantum_fireballp_as20_5"); 


Listing 1: Two cluster nodes which share most of their configuration. 
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...) and allows compound types to be built on top of 
these. Once the type of a configuration element is 
known, the compiler makes sure that only values of 
the right type are assigned to it. By explicitly specify- 
ing the type of the root element (i.e., the top of the 
configuration tree), one can completely define the 
schema of the information that is found in the LLD. 
The type enforcement done by the compiler guaran- 
tees that only LLDs conforming to the schema will be 
generated. This enforcement is illustrated in Listing 2. 


Starting with the root of the LLD, the compiler 
will make sure that the data corresponds to the 
declared type. Extra and missing fields in structures 
trigger a compilation error. The code in Listing 2 
ensures that /hardware/memory/size is always present 
and contains a positive long integer. 


Validation 


To have even greater control on the information 
generated by the compiler, one can attach arbitrary 
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validation code either to a type or to a configuration 
path; see Listing 3. 


Data Manipulation Language 


The validation code is represented in a simple 
yet powerful data manipulation language which is a 
subset of Pan and syntactically similar to C or Perl. 
Rather than embedding another language such as Perl 
or Python for this task, we decided to design our own. 
This was necessary to maintain control over type 
checking and to encourage users to use the declarative 
parts of Pan. Builtin functions such as pattern match- 
ing and substitution are available and user defined 
functions are supported. 


Although we prefer the declarative approach to 
the procedural approach, this data manipulation lan- 
guage is very convenient to perform complex opera- 
tions. Listing 4 illustrates the use of Pan to introduce 
an element into a given list position. 


# structure representing the (physical) memory 


define type memory_t = { 
"size" long(0..) 
iy 


## a long which is greater than 0 


## structure representing the complete hardware 


define type hardware_t = { 


"vendor" : string 
"model" : string 
"serial" ? string 
"memory" memory_t 
"cpus" 

"devices" device_t{} 


bs 


# this field is optional 


cpu_t[1..8] # a list of between 1 and 8 cpu_t 
i# a (maybe empty) table of device_t 


## structure representing the root of the configuration tree 


define type root_t = { 


"hardware" hardware_t 
"System" system_t 
"software" software_t 


v; 


## the root of the configuration tree (i.e., /) must be of type root_t 


type "/" = root_t; 


Listing 2: Type enforcement. 
a em ee Be 


## IPv4 address in dotted number notation 


define type ipv4 = string with { 


result = matches(self, '*(\d+)\.(\d+)\. (\d+)\. (\dt+)$’); 


if (length(result) == 0) 
return("bad string"); 
2 = Te 


while (i <= 4) { 
x = to_long(result[i]); 
af ts 2 ZAS) 


return("chunk " + to_string(i) + " too big: " + result[i]); 


2, oa a hs 
}s 
return(true) ; 
RR 


## make sure that we have at least 256MB of RAM per processor 


valid "/hardware/memory/size" = self >= 256 


length(value("/hardware/cpus")); 


Listing 3: Validation code. 
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Miscellaneous 


The Pan compiler keeps track of derivation infor- 
mation which precisely links configuration informa- 
tion appearing in the LLD to the originating HLD 
statements. When HLD templates are modified, this 
derivation information is used to determine which 
LLDs must be recreated, thus minimizing the work 
carried out by the compiler. The information is also 
used to determine which HLD templates are responsi- 
ble for the final value of a given configuration param- 
eter. 


Comparison With Other Tools 


Pan (and its associated compiler) cannot be con- 
sidered as a system administration tool by itself: it is 
only a language to express configuration information. 
As far as we know, there exists no similar tool to com- 
pare directly with. What follows is a comparison with 
the way different tools or projects manipulate system 
configuration information. 


Arusha Project 


The Arusha Project (ARK, http://ark.source- 
forge.net) provides a framework for collaborative sys- 
tem administration [13]. It provides a simple, XML- 
based language that can be used to describe almost 
everything, from package management to documenta- 
tion or system configuration. Unfortunately, this lan- 
guage lacks strong type checking and validation. It 
also mixes code and data (e.g., some Perl or Python 
code can be embedded inside XML, close to the data) 
which is something that we do not want. 

Cfengine 

Cfengine (http://www.cfengine.org) is an auton- 
omous agent [3] with a high level declarative language 
to manage large computer networks. It has no real types 
and a limited support for lists. Its configuration is not a 
real abstraction of the machine configuration but rather 
some instructions for its different modules such as 
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network interface configuration, symbolic links man- 
agement, checks for permissions and ownership of 
files, etc. For instance, the modification of system files 
like /etc/inetd.conf is often done with instructions such 
as AppendlfNoSuchLine or CommentLinesMatching. It has 
no support for validation. 


DMTF 


The Distributed Management Task Force 
(DMTF, _http://www.dmtf.org) is an organization 
developing “management standards.” The standards 
closest to Pan are part of the Common Information 
Model (CIM, _ http://www.dmtf.org/standards/stan- 
dard_cim.php). Their approach is complex and mixes 
configuration management and system monitoring. 
Although their standards could not be used directly 
inside our work, we have tried to stay close and to 
reuse some parts of their data schemas. 


LCFG 


LCFG [1] (http://www.|cfg.org) is a system for 
automatically installing and managing the configura- 
tion of large numbers of Unix systems. It does use 
some abstraction to describe the machine configura- 
tion but the language used does not really have types. 
All the parameters are basically strings similar to X 
resources and compound types (such as lists or tables) 
are built on top of these with some ad-hoc name man- 
gling. The inheritance is achieved by using cpp and 
include files. The combination of cpp macros and 
embedded Perl code hinder the clarity of this other- 
wise mainly declarative language. On the other hand, 
it has some advanced features (like constraint based 
list ordering) that will probably be added to Pan in the 
future. 


Although LCFG and EDG are separate projects, 
the development teams share ideas and some compati- 
bility exists. For instance, the Pan compiler can pro- 
duce some XML files that can be understood by the 
LCFG components. 


## insert a string after another one in a list of strings 


4+ (or at the end if not found) 
define function insert_after = { 
if (arge != 3 
lis_list(argv[2])) 


lis_string(argv(0]) || !is_string(argv[1]) || 


error("usage: insert_after(string, string, list)"): 


idx = index(argv[1], argv[2]); 
if (idx < 0) { 
## not found, we insert at the end 


splice(argv[2], length(argv[2]), 0, list (argv[0])); 


} else { 
## found, we insert just after 
splice(argv[2], idxtl, 0, 
Ls 
return(argv[2]); 
yy 


list(argv[0])); 


## here is how to use it to insert "apache" after "dns" 


"/boot/services" = list("dns", "dhep", 
"/boot/services" = 


insert_after("apache", 


"mail", "postgres"); 


"dns", value("/boot/services")); 


Listing 4: Introducing an element into a given list position. 
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Status and Availability 


After a first Perl prototype last year, the new 
compiler (built using C++, STL, Lex and Yacc) is 
almost complete (at the time of this writing) and will 
be delivered to the EDG in September. 


The software will be available under the open 
source EDG Software License® from the EDG WP4 
Configuration Task web site at http://cern.ch/hep-proj- 
grid-fabric-config . 

At the time of this writing, the Pan language has 
been successfully used to describe a large fraction of 
the configuration of the Linux machines used inside 
the EDG project. Work is in progress to extend this to 
other machines in our computer centre at CERN. 


Acknowledgements 


We would like to thank the European Union for 
their support of the EDG project and our colleagues 
from the WP4 for their contributions and very fruitful 
discussions on the topics of system administration and 
configuration. 


Authors Biographies 


Lionel Cons earned an “‘Ingénieur de |’Ecole 
Polytechnique” (Paris) diploma in 1988 and the 
ENSIMAG’s engineer’s diploma two years later, He 
then joined CERN where he worked as a C software 
developer and then as a UNIX system engineer in the 
Information Technology division. He presently works 
on system security and is the leader of the WP4 con- 
figuration task of the EDG project. 


Piotr Poznanski earned an MSc degree in Com- 
puter Science from the Univeristy of Mining and Met- 
allurgy (Cracow, Poland). He joined CERN in 2000 
and currently works in the EDG project as a software 
engineer. 


References 


[1] Anderson, Paul, “Towards a High-Level 
Machine Configuration System,” L/SA Confer- 
ence Proceedings, 1994. 

[2] Anderson, Paul, 4A Declarative Approach to the 
Specification of Large-Scale System Configurations, 
http://www.des.ed.ac.uk/home/paul/publications/ 
conflang.pdf, 2001. 

[3] Burgess, Mark,“‘Computer Immunology,” LISA 
Conference Proceedings, 1998. 

[4] Cons, Lionel and Piotr Poznanski, Configuration 
Database Global Design, http://cern.ch/hep-proj- 
grid-fabric-config, 2002. 

[5] Cons, Lionel and Piotr Poznanski, High Level 
Configuration Description Language Specifica- 
tion, http://cern.ch/hep-proj-grid-fabric-config, 2002. 

[6] da Silva, Fabio Q. B., Juliana Silva da Cunha, 
Danielle M. Franklin, Luciana S. Varejao, and 


Shttp://www.eu-datagrid.org/license.html 4 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


Pan: A High-Level Configuration Language 


Rosalie Belian, ‘A Configuration Distribution 
System for Heterogeneous Networks,” LISA 
Conference Proceedings, 1998. 

[7] da Silveira, Gledson Elias and Fabio Q. B. da 
Silva, “A Configuration Distribution System for 
Heterogeneous Networks,” L/SA Conference 
Proceedings, 1998. 

[8] European Union DataGrid Project (EDG), http:// 
www.eu-datagrid.org . 

[9] Evard, Rémy, “An Analysis of UNIX System 
Configuration,” LISA Conference Proceedings, 
L997. 

[10] Foster, Ian and Carl Kesselman, The Grid: 
Blueprint for a New Computing Infrastructure, 
Morgan Kaufmann, http://www.mkp.com/books __ 
catalog/catalog.asp?ISBN=1-55860-475-8, 1998. 

[11] George, Michael, Node Profile Specification, 
http://cern.ch/hep-proj-grid-fabric-config, 2002. 

[12] Harlander, Dr. Magnus, “Central System Admin- 
istration in a Heterogeneous Unix Environment: 
GeNUAdmin,” LISA Conference Proceedings, 
1994. 

[13] Holgate, Matt and Will Partain, ‘“‘The Arusha 
Project: A Framework for Collaborative Unix 
System Administration,” LSA Conference Pro- 
ceedings, 2001. 

[14] Large Scale System Configuration Workshop, 
http:/Avww.dcs.ed.ac.uk/home/despaul/wshop, 2001. 

[15] Poznanski, Piotr, Node View Access API Specifi- 
cation, http://cern.ch/hep-proj-grid-fabric-config, 
2002. 

[16] Raymond, Eric S., The Cathedral and the 
Bazaar. O'Reilly, http://www.oreilly.com/cata- 
log/cb, 1999. 

[17] Rouillard, John P. and Richard B. Martin, “‘Con- 
fig: A Mechanism for Installing and Tracking 
System Configurations,’ LISA Conference Pro- 
ceedings, 1994. 

[18] Sventek, Joe, Configuration, Monitoring and 
Management of Huge-scale Applications with a 
Varying Number of Application Components, 
http://www.dcs.ed.ac.uk/home/dcspaul/wshop/ 
HugeScale.pdf, 2001. 

[19] Traugott, Steve and Joel Huddleston, ‘Boot- 
strapping an Infrastructure,” LISA Conference 
Proceedings, 1998. 

[20] van der Hoek, André, Dennis Heimbigner, and 
Alexander L. Wolf, Software Architecture, Configu- 
ration Management, and Configurable Distributed 
Systems: A Ménage a Trois, http://citeseer.nj. 
nec.com/hoek98software.html, 1998. 

[21] WP4, EDG Fabric Management Work Package, 
http://cern.ch/hep-proj-grid-fabric. 

[22] WP4C, EDG WP4 Configuration Task, http:// 
cern.ch/hep-proj-grid-fabric-config. 


89 


Pan: A High-Level Configuration Language Cons & Poznanski 


Appendix A: Partial LLD Example 


This is an oversimplified example; more complete examples can be found on our web site [22]. 


<?xml version="1.0" encoding="utf-8"?> 
<nlist name="profile" type="record"> 
<nlist name="hardware" type="record"> 
<string name="vendor">Elonex</string> 
string name="model">850/256</string> 
<list name="cpus"> 
<nlist type="record"> 
string name="vendor">Intel</string> 
<string name="model">Pentium III (Coppermine) </string> 
<double name="speed">853.22</double> 
</nlist> 
</list> 
<string name="serial">CHO1112041</string> 
<nlist name="memory" type="record"> 
<long name="size">256</long> 
</nlist> 
<nlist name="devices" type="table"> 
<nlist name="hda" type="record"> 
<string name="vendor">QUANTUM</string> 
<string name="model">FIREBALLP AS20.5</string> 
string name="type">disk</string> 
<long name="size">19596</long> 
</nlist> 
<nlist name="hdc" type="record"> 
<string name="vendor">LG</string> 
<string name="model">CRD-8521B</string> 
<string name="type">cd</string> 
</nlist> 
<nlist name="ethO" type="record"> 
<string name="vendor">3Com</string? 
<string name="model">3c905B-Combo [Deluxe Etherlink XL 10/100] </string> 
<string name="type">net</string> 
<string name="driver">3c59x</string> 
<string name="address">00:d0:b7:a9:a3:47</string> 
</nlist> 
</nlist> 
</nlist> 
<nlist name="system" type="record"> 
<list name="mounts"> 
<nlist type="record"> 
<string name="type">swap</string> 
<string name="path">swap</string> 
<string name="device">hdal</string> 
</nlist> 
<nlist type="record"> 
<string name="type">»ext2</string> 
<string name="path">/</string> 
<string name="device">hda2</string> 
</nlist> 
<nlist type="record"> 
<string name="type">ext2</string> 
<string name="path">/var</string> 
<string name="device">hda3</string> 
</nlist> 
<nlist type="record"> 
<string name="type">proc</string> 
<string name="path">/proc</string> 
</nlist> 
<nlist type="record"> 
<string name="type">devpts</string> 
<string name="path">/dev/pts</string> 
<list name="options"> 
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<string> gid=5</string> 
<string>mode=620</string> 
</list> 
</nlist> 
<nlist type="record"> 
<string name="type">ext2</string> 
<string name="path">/mnt/floppy</string> 
<list name="options"> 
<string>noauto</string> 
<string> owner</string> 
</list> 
<string name="device">fd0</string> 
</nlist> 
<nlist type="record"> 
<string name="type">afs</string> 
<string name="path">/afs</string> 
</nlist> 
<nlist type="record"> 
<string name="type">iso9660</string> 
<string name="path">/mnt/cdrom</string> 
<list name="options"> 
<string>noauto</string> 
<string> owner</string> 


<string>ro</string> 
</list> 
<string name="device">hdc</string> 
</nlist> 
</list> 


<nlist name="partitions" type="table"> 

<nlist name="hdal" type="record"> 
<string name="type">primary</string> 
<string name="disk">hda</string> 
<long name="size">512</long> 
<long name="id">82</long> 

</nlist> 

<nlist name="hda2" type="record"> 
<string name="type">primary</string> 
<string name="disk">hda</string> 
<long name="size">18828</long> 
<long name="id">83</long> 

</nlist> 

<nlist name="hda3" type="record"> 
<string name="type">primary</string> 
<string name="disk">hda</string> 
<long name="size">256</long> 
<long name="id">83</long> 

</nlist> 

</nlist> 
</nlist> 
</nlist> 


Appendix B: Partial HLD Examples 


These HLD templates have been used to generate the LLD found in Appendix A. More sample code can be 
found on our web site [22]. 
functions.tpl 
{HHH HAHAHAHAHAHA AAA AAA AHH HAHAHA 


# Useful functions. 


HHH HAHAHAHAHAHA AAA A AHH 
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declaration template functions; 

## insert_after(string, string, list): insert the first string after the second 
4t one (if found) or at the end (otherwise); the last argument is modified but 
## also returned as the result of the function 


define function insert_after = { 
if (arge != 3 || 
lis_string(argv[0]) || !is_string(argv[1]) || !is_list(argv([2])) 


error("usage: insert_after(string, string, list)"); 
idx = index(argv[1], argv[2]); 
if (idx < 0) { 
# not found, we insert at the end 
splice(argv(2], length(argv[2]), 0, list(argv[0])):; 
} else { 
## found, we insert just after 
splice(argv[2], idxt+l, 0, list(argv[0])); 
MS 
return(argv[2]); 
}; 
## given a disk name, return a table of three primary partitions for swap, root 
## and /var with a very simple space allocation algorithm 


define function simple_partitions = { 
if (arge != 1 || !is_string(argv[0])) 
error("usage: simple_partitions(string)"); 
disk = argv[0]; 
disk_size = value("/hardware/devices/" + disk + "/size"); 
## swap is twice the size of the physical memory 
swap = nlist( 


"disk", disk, 

"type", "primary", 

"size", 2 * value("/hardware/memory/size"), 
id, 82, # Linux swap 


ve 
j## var is 256MB for disks larger than 2GB, 128MB otherwise 
var = nlist( 


"disk", disk, 

"type", "primary", 

"size", if (disk_size > 2048) 256 else 128, 
"id", 83, # Linux 


) 3 
## root is the rest 
root = nlist( 


"disk", disk, 

"type", "primary", 

"size", disk_size - swap["size"] - var["size"], 
fia 83, # Linux 


)e 
## order of partitions is swap, root and var 
return(nlist( 
disk+"1", swap, 
disk+"2", root, 
diskt%3".; vary 
ye 
13 


types.tpl 

{HEHHHHH HAHAHAHAHAHA AA AA 
j## Useful (but simplified) types. 

JHHE HEHE HAHAHAHAHAHA AA HHH AA HAA HAHA AE 
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declaration template types; 
(HEHE HAHAHAHAHA AAA HH HAHAH 
# simple types 
# unsigned long 
#(old style) define type ulong = long with self >= 0; 
define type ulong = long(0..); 
## unsigned double 
define type udouble = double(0..); 
## IPv4 address in dotted number notation 
define type ipv4 = string with { 
result = matches(self, '*(\d+)\.(\d+)\. (\d+)\. (\d+)$°); 


if (length(result) == 0) 
return("bad string") ; 
a, ast ys 


while (i <= 4) { 

x = to_long(result[i]); 

LE ix > 255) 

return("chunk " + to_string(i) + " too big: " + result[i]); 

a = ds 1s 
h; 
return(true) ; 
by 
{HEHEHE 


## hardware types 

## memory record 

define type memory_t = { 
"size" : ulong 

he 

## CPU record 

define type cpu_t = { 


"vendor" : string 
"model" : string 
"speed" : udouble 


YM; 
# device record (describing some hardware devices such as disks) 
define type device_t = { 


"type" > string with match(self, ’*(disk|cd|net)$’) 
"vendor" 2 string 

"model" : string 

"size" ? ulong 

"driver" ? string 

"address" ? string 


iy 
## hardware record (describing some complete hardware information) 


define type hardware_t = { 


"vendor" : string 

"model" : string 

“serial” : string 

"memory" : memory_t 

"cpus" > epastli. .] # list of at least one CPU 

"devices" : device_t{} # table of devices, indexed by names such as hda 


a 
{HEHEHE HAHAHAHAHAHA HHA 


# system types 


# mount record (describing what will end up in /etc/fstab) 
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define type mount_t = { 
"device" ? string 


"path" : string 
"type" > string 
"name" ? string 


"options" ? string[] 


se 


## partition record (describing how to partition the disks) 
define type partition_t = { 


"disk" : string with value("/hardware/devices/"tself+"/type") == "disk" 
"type" : string 
"size" : ulong 
"ad" : ulong 


)3 
# system record (describing some of the system configuration) 
define type system_t = { 
"mounts" : mount_t[l..] # list of at least one mount 
"partitions" ? partition_t{} # table of partitions, indexed by, e.g., hdal 


HAHAHAHAHAHA HA HAA HHH 
# root type 
## type of the root of the configuration information 
define type root_t = { 
"hardware" : hardware_t # hardware subtree 
"system" : system_t ## system subtree 
}; 
## declare that root is indeed of the root type 
type "/" = root_t; 
hardware.tpl 
iHHHAHHAHAB HAHAHAHA AAA AAA AAA AA AHA HAHA 
## Sample hardware data. 
{HAHAHAHAHAHA HABA AAA AAA A HA HAE 
(HAHAHAHAHA HAA AAA AHHH HE 
# cpus 


structure template cpu_intel_p3_800; 
"vendor" “Intel; 

"model" 
"speed" 


"Pentium III (Coppermine)"; 
796.550; # MHz 

structure template cpu_intel_p3_850; 
"vendor" “Intel's 

"model" "Pentium III (Coppermine)"; 
"speed" = 853.220; # MHz 


iHEHHHAHA RAAB AA AAA HAHAH HAHAHA 
## disks 


structure template disk_quantum_fireballp_as20_5; 


"“typer = "disk": 

"vendor" = "QUANTUM"; 

"model" = “FTREBALLP AS20.5"; 

"Size" = 19596; i MB 

structure template disk_ibm_dtla_307030; 
"type" = "disk": 

"vendor" = TBM: 

"model" = "DILA=307030"; 

"size" = 29314; # MB 
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#HHHHHHH HERA HHAAE HH HAHAAHAAAAAAHAAAAA RA AAAAAAAAAAAE 
## cdroms 


structure template cdrom_lg_crd_8521b; 


"type" = Yod"; 
"vendor" = "EG" 
"model" = “CRD-6521B"$ 


JHHA HAHAHAHAHAHA HAHAHAHA AA AAA AAA AAA AAA 
## network cards 


structure template network_3com_3c905b; 


"type" = Wheels 
"vendor" = "3Com"s 
"model" = "3c905B-Combo [Deluxe Etherlink XL 10/100]"; 


"driver" "3c59x"3 


structure template network_intel_82557; 


"type" = Nyaeit hs 
"vendor" = "Intel"; 
"model" = "82557 [Ethernet Pro 100]"; 


"driver" 
{HEHEHE AHHH AAA AAA AAA HAA 


## computers 


"eeprol00"; 


structure template pc_elonex_850_256; 


"vendor" = "Elonex"; 

"model" = "850/256"; 

"epus" = list (create("cpu_intel_p3_850")); 
"memory/size" = 256; # MB 

"devices/hda" = create("disk_quantum_fireballp_as20_5"); 
"devices/hdce" = create("cdrom_lg_crd_8521b") ; 
"devices/ethO" = create("network_3com_3c905b") ; 


structure template pc_elonex_800x2_512; 


"vendor" = "Elonex"; 
"model" = "800x2/512"; 
"cpus" = list (create("cpu_intel_p3_800"), create("cpu_intel_p3_800")); 
"memory/size" = 512; ## MB 
"devices/hda" = create("disk_ibm_dtla_307030") ; 
"devices/ethO" = create("network_intel_82557") ; 
system.tpl 


{HEHE HHA HAHAHAHAHAHA AAA AAA AAA A A A AE 
## Sample system data. 

{HAHAHAHAHAHA AAA AAA AAA AAP 
{HEHE HHA HAA AAA AAA AAA AAA 


## standard mounts 


structure template mount_afs; 
"path" = "/afs";: 

"type" = "afs": 

structure template mount_proc; 
"path" = Uf DEOONs 

"type" = “procs 


structure template mount_devpts; 
"path" = "/dev/pts"; 

"type" = "devpts"; 

"options" = list("gid=5", "mode=620") ; 
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structure template mount_floppy; 


"device" = "fd0"; 

"path" = "/mnt/floppy"; 

"type" = “extZ"s; 

"options" = list("noauto", "owner"); 


structure template mount_cdrom; 


"device" = undef; 

"path" = "/mnt/cdrom"; 

"type" = "i1s09660"; 

"options" = list("noauto", "owner", "ro"); 


{HEAR AAA AH AHH HEHE 
# mounting templates 


## add the standard Linux mount entries 

template mounting linux; 

"/system/mounts" = merge(value("/system/mounts"), list( 
create("mount_proc"), 
create("mount_devpts"), 
create("mount_floppy") , 


Os 


## add the AFS mount entry 
template mounting_afs; 
"/system/mounts" = merge(value("/system/mounts"), list(create("mount_afs"))); 


sample.tpl 

{HAHAHAHAHA AA HAAR HHA HHA 
## Sample object template. 

{HAHAHAHAHAHA AAA AAA AH HA HHH 
object template sample; 


## standard includes 
include types; 
include functions; 


## hardware information 


"/hardware" = create("pc_elonex_850_256"); 

"/hardware/serial" = "CHO1112041"; 

"/hardware/devices/eth0/address" = "00:d0:b7:a9:a3:47"; 

i# system information 

"/system/partitions" = simple_partitions("hda") ; 

"/system/mounts/0" = nlist("type", "swap", "path", "swap", "device", "hdal") ; 
"/system/mounts/1" = nlist("type", "ext2", "path", "/", "device", "hda2"): 
"/system/mounts/2" = nlist("type", "ext2", "path", "/var", "device". "hda3") ; 


include mounting_linux; 
include mounting_afs; 


## we also add a mount entry for our CD drive .. 


"/system/mounts" = merge(value("/system/mounts"), 
list(create("mount_cdrom", "device", "hde"))); 

## ...and make sure that hdc indeed contains a CD drive! 

valid "/hardware/devices/hdc" = self["type"] == "cd"; 


Appendix C: NFS Validation Example 
xvalidation.tpl 
{HHHAHAHHAHA HAHAHAHAHAHA HAA AHH 
## Simplified example of cross object validation. 


## All the NFS clients check that the NFS servers that they use indeed export 
# the directories that they mount. This is done transparently by adding some 
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## validation code to the mount record type. Further checks such as wildcards 
## in export list or export/mount options mismatch are left as an exercise for 
## the reader ;-) 


## Here is how to compile the server and two clients (result on stdout): 


i# % pan --stdout --output=nfssrvl xvalidation.tpl (will succeed) 
# = =% pan --stdout --output=nfscltl xvalidation.tpl (will succeed) 
# = % pan --stdout --output=nfsclt2 xvalidation.tpl (will fail) 


{HAHAHAHAHAHA AAA AAA AAA HH 
{HHH HAHAHAHAHA AAA AAA AAP 
## types definitions 

template types; 


# export record (roughly what is in /etc/exports) 

define type export = { 
"path" : string description "path of the exported directory" 
“client” =: string description "name of client allowed to mount it" 
"options" ? string[] description "list of exporting options like ro" 

}; 

## mount record (roughly what is in /etc/fstab) 


define type mount = { 
"device" : string description "device as understood by the mount command" 
"path" : string description "path of the mount point" 
"type" : string description "type of the mounted filesystem" 
"name" ? string description "name or label of this mount entry" 


"options" ? string[] description "list of mounting options like ro" 
} with valid_mount (self) ; 


## validation of a mount record (only nfs type records are checked) 
define function valid_mount = { 

## the mount record is our only argument 

mount = argv([0]; 

## we only care about NFS mounts, other types are considered OK 

if (mount["type"] != "nfs") 

return(true) ; 

## the device field will give us the NFS server and path 

result = matches(mount["device"], **([\w\.\-]+):(.+)$’); 

if (length(result) == 0) 


error("bad nfs device: " + mount["device"]); 
server = result[l]; 
path = result[2]; 


## we now look at the server’s exports list 

exports = value("//" + server + "/system/exports") ; 

i: = OFF 

len = length(exports) ; 

while (i < len) { 
## we check if this export record is good for us by checking the client 
# field against object (i.e., the name of the current object template) 
## and the path; we want exact match and ignore the export/mount options 


if (exports[i] ["client"] == object && exports[i]["path"] == path) 
return(true); 
dt 2 as 


ir 
## we haven't found any export record matching our needs, we complain: 
error("server " + server + " does not export "+ path + " to " + object) ; 


he 
{HEHHHARA HAHAHAHAHAHA A HH HAHAH HH HE 


## NFS server definition 


\ 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 97 


Pan: A High-Level Configuration Language Cons & Poznanski 


object template nfssrvl; 


it type settings 
include types; 


type "/system/exports" = export[]; 
## data for this host 
"/system/exports" = list ( 
nlist ( ## we export /home to hostx 
"path", "/home", 
"client", “host«",, 
), 
nlist ( ## we export /home to nfscltl, read-only 
"path", "/home", 
“client", "“ntselel", 
‘options, Jase"), 
Me 


)3 

{HHHHHAAAABBHAA BARA AAR AAA HAHAH 
## NFS clients definitions 

template client; 


i## type settings 
include types; 


type "/system/mounts" = mount []; 
## data for this host 
"/system/mounts" = list( 
nlist( ## we mount /dev/hdal as the root filesystem 
"device", "/dev/hdal", 
"path", npn 
"type", "“ext2", 
dg 
nlist ( i## we NFS mount /home from the server nfssrvl 
"device", "nfssrvl:/home", 
"path", "/home", 
"type", Naiteo. 
Ms 
)5 


## first client: known by the server, compilation will succeed 
object template nfscltl; 
include client; 


## second client: unknown to the server, compilation will fail with: 
## *** user error: server nfssrvl does not export /home to nfsclt2 
object template nfsclt2; 

include client; 
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ABSTRACT 


Hosts in a well-architected enterprise infrastructure are self-administered; they perform their 
own maintenance and upgrades. By definition, self-administered hosts execute self-modifying 
code. They do not behave according to simple state machine rules, but can incorporate complex 
feedback loops and evolutionary recursion. 


The implications of this behavior are of immediate concern to the reliability, security, and 
ownership costs of enterprise and mission-critical computing. In retrospect, it appears that the 
same concerns also apply to manually-administered machines, in which administrators use tools 
that execute in the context of the target disk to change the contents of the same disk. The self- 
modifying behavior of both manual and automatic administration techniques helps explain the 
difficulty and expense of maintaining high availability and security in conventionally-administered 
infrastructures. 


The practice of infrastructure architecture tool design exists to bring order to this self- 
referential chaos. Conventional systems administration can be greatly improved upon through 
discipline, culture, and adoption of practices better fitted to enterprise needs. Creating a low-cost 
maintenance strategy largely remains an art. What can we do to put this art into the hands of 
relatively junior administrators? We think that part of the answer includes adopting a well-proven 
strategy for maintenance tools, based in part upon the theoretical properties of computing. 


In this paper, we equate self-administered hosts to Turing machines in order to help build a 
theoretical foundation for understanding this behavior. We discuss some tools that provide mech- 
anisms for reliably managing self-administered hosts, using deterministic ordering techniques. 


Based on our findings, it appears that no tool, written in any language, can predictably 
administer an enterprise infrastructure without maintaining a deterministic, repeatable order of 
changes on each host. The runtime environment for any tool always executes in the context of the 
target operating system; changes can affect the behavior of the tool itself, creating circular 
dependencies. The behavior of these changes may be difficult to predict in advance, so testing is 
necessary to validate changed hosts. Once changes have been validated in testing they must be 
replicated in production in the same order in which they were tested, due to these same circular 
dependencies. 


The least-cost method of managing multiple hosts also appears to be deterministic ordering. 
All other known management methods seem to include either more testing or higher risk for each 
host managed. 


This paper is a living document; revisions and discussion can be found at Infrastructures.Org, 
a project of TerraLuna, LLC. 


Foreword 
by Steve Traugott 


In 1998, Joel Huddleston and I suggested that an 
entire enterprise infrastructure could be managed as 
one large ‘‘enterprise virtual machine” (EVM) [boot- 
strap]. That paper briefly described parts of a manage- 
ment toolset, later named ISconf [isconf]. This toolset, 
based on relatively simple makefiles and shell scripts, 
did not seem extraordinary at the time. At one point in 
the paper, we said that we would likely use cfengine 
[cfengine] the next time around — I had been following 
Mark Burgess’ progress since 1994. 


That 1998 paper spawned a web site and com- 
munity at Infrastructures.Org. This community in turn 
helped launch the Infrastructure Architecture (IA) 
career field. In the intervening years, we’ve seen the 
Infrastructures.Org community grow from a few 
dozen to a few hundred people, and the IA field blos- 
som from obscurity into a major marketing campaign 
by a leading systems vendor. 


Since 1998, Joel and I have both attempted to 
use other tools, including cfengine version 1. I’ve 
also tried to write tools from scratch again several 
times, with mixed success. We have repeatedly hit 
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indications that our 1998 toolset was more optimized 
than we had originally thought. It appears that in 
some ways Joel and I, and the rest of our group at 
the Bank, were lucky; our toolset protected us from 
many of the pitfalls that are laying in wait for IAs. 


One of these pitfalls appears to be deterministic 
ordering; I never realized how important it was until I 
tried to use other tools that don’t support it. When left 
without the ability to concisely describe the order of 
changes to be made on a machine, I’ve seen a marked 
decrease in my ability to predict the behavior of those 
changes, and a large increase in my own time spent 
monitoring, troubleshooting, and coding for excep- 
tions. These experiences have shown me that loss of 
order seems to result in lower production reliability 
and higher labor cost. 


The ordered behavior of ISconf was more by 
accident than design. I needed a quick way to get a 
grip on 300 machines. I cobbled a prototype together 
on my HPIOOLX palmtop one March *94 morning, 
during the 35-minute train ride into Manhattan. I used 
‘make’ as the state engine because it’s available on 
most UNIX machines. The deterministic behavior 
‘make’ uses when iterating over prerequisite lists is 
something I didn’t think of as important at the time — I 
was more concerned with observing known dependen- 
cies than creating repeatable order. 


Using that toolset and the EVM mindset, we were 
able to repeatedly respond to the chaotic international 
banking mergers and acquisitions of the mid-90’s. This 
response included building and rebuilding some of the 
largest trading floors in the world, launching on sched- 
ule each time, often with as little as a few months’ 
notice, each launch cleaner than the last. We knew at 
the time that these projects were difficult; after trying 
other tool combinations for more recent projects I think 
I have a better appreciation for just how difficult they 
were. The phrase “throwing a truck through the eye of 
a needle” has crossed my mind more than once. | don’t 
think we even knew the needle was there. 


At the invitation of Mark Burgess, I joined his 
LISA 2001 [lisa] cfengine workshop to discuss what 
we’d found so far, with possible targets for the 
cfengine 2.0 feature set. The ordering requirement 
seemed to need more work; | found ordering surpris- 
ingly difficult to justify to an audience practiced in the 
use of convergent tools, where ordering is often con- 
sidered a constraint to be specifically avoided [couch, 
eika-sandnes]. Later that week, Lance Brown and | 
were discussing this over dinner, and he hit on the idea 
of comparing a UNIX machine to a Turing machine. 
The result is this paper. 


Based on the symptoms we have seen when com- 
paring ISconf to other tools, I suspect that ordering isa 
keystone principle in automated systems administra- 
tion. Lance and I, with a lot of help from others, will 
attempt to offer a theoretical basis for this suspicion. 
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We encourage others to attempt to refute or support this 
work at will; I think systems administration may be 
about to find its computer science roots. We have also 
already accumulated a large FAQ for this paper — we'll 
put that on the website. Discussion on this paper as well 
as related topics is encouraged on the infrastructures 
mailing list at http://Infrastructures.Org . 


Why Order Matters 


There seem to be (at least) several major reasons 
why the order of changes made to machines is impor- 
tant in the administration of an enterprise infrastructure: 


A “circular dependency” or control-loop problem 
exists when an administrative tool executes code that 
modifies the tool or the tool’s own foundations (the 
underlying host). Automated administration tool design- 
ers cannot assume that the users of their tool will always 
understand the complex behavior of these circular 
dependencies. In most cases we will never know what 
dependencies end users might create; see assertions 
§A.40 and §A.46 in the ‘Turing Equivalence’ section of 
this paper. 

A test infrastructure is needed to test the behavior 
of changes before rolling them to production. No tool 
or language can remove this need, because no testing is 
capable of validating a change in any conditions other 
than those tested. This test infrastructure is useless 
unless there is a way to ensure that production machines 
will be built and modified in the same way as the test 
machines; see ‘The Need for Testing’ section. 


It appears that a tool that produces deterministic 
order of changes is cheaper to use than one that per- 
mits more flexible ordering. The unpredictable behav- 
ior resulting from unordered changes to disk is more 
costly to validate than the predictable behavior pro- 
duced by deterministic ordering; see §$A.58. Because 
cost is a significant driver in the decision-making pro- 
cess of most IT organizations, we will discuss this 
point more in the next section. 


Local staff must be able to use administrative 
tools after a cost-effective (i.e., cheap and quick) turn- 
over phase. While senior infrastructure architects may 
be well-versed in avoiding the pitfalls of unordered 
change, we cannot be on the permanent staff of every 
IT shop on the globe. In order to ensure continued 
health of machines after rollout of our tools, the tools 
themselves need to have some reasonable default 
behavior that is safe if the user lacks this theoretical 
knowledge; see §A.40 and §A.54. 


This business requirement must be addressed by 
tool developers. In our own practice, we have been 
able to successfully turnover enterprise infrastructures 
to permanent staff many times over the last several 
years. Turnover training in our case is relatively sim- 
ple, because our toolsets have always implemented 
ordered change by default. Without this default behav- 
ior, we would have also needed to attempt to teach 
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advanced techniques needed for dealing with 
unordered behavior, such as inspection of code in 
vendor-supplied binary packages; see the ‘Right Pack- 
ages, Wrong Order’ section. 


A Prediction 


“Order Matters” when we care about both qual- 
ity and cost while maintaining an enterprise infrastruc- 
ture. If the ideas described in this paper are correct, 
then we can make the following prediction: 


The least-cost way to ensure that the behavior of 
any two hosts will remain completely identical is 
always to implement the same changes in the 
same order on both hosts. 


This sounds very simple, almost intuitive, and 
for many people it is. But to our knowledge, isconf 
[isconf] is the only generally-available tool which 
specifically supports administering hosts this way. 
There seems to be no prior art describing this princi- 
ple, and in our own experience we have yet to see it 
specified in any operational procedure. It is trivially 
easy to demonstrate in practice, but has at times been 
surprisingly hard to support in conversation, due to the 
complexity of theory required for a proof. 


Note that this prediction does not apply only to 
those situations when you want to maintain two or 
more identical hosts. It applies to any computer-using 
organization that needs cost-effective, reliable opera- 
tion. This includes those that have many unique pro- 
duction hosts; see ‘The Need for Testing.” The ‘Con- 
gruence’ section discusses this further, including sin- 
gle-host rebuilds after a security breach. 


This prediction also applies to disaster recovery 
(DR) or business continuity planning. Any part of a 
credible DR procedure includes some method of 
rebuilding lost hosts, often with new hardware, in a 
new location. Restoring from backups is one way to 
do this, but making complete backups of multiple 
hosts is redundant — the same operating system com- 
ponents must be backed up for each host, when all we 
really need are the user data and host build procedures 
(how many copies of /bin/ls do we really need on 
tape?). It is usually more efficient to have a means to 
quickly and correctly rebuild each host from scratch. 
A tool that maintains an ordered record of changes 
made after install is one way to do this. 


This prediction is particularly important for those 
organizations using what we call self-administered 
hosts. These are hosts that run an automated configura- 
tion or administration tool in the context of their own 
operating environment. Commercial tools in this cate- 
gory include Tivoli, Opsware, and CenterRun [tivoli, 
opsware, centerrun]. Open-source tools include 
cfengine, Icfg, pikt, and our own isconf [cfengine, Icfg, 
pikt, isconf]. We will discuss the fitness of some of 
these tools later — not all appear fully suited to the task. 


This prediction applies to those organizations 
which still use an older practice called “cloning” to 
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create and manage hosts. In cloning, an administrator 
or tool copies a disk image from one machine to 
another, then makes the changes needed to make the 
host unique (at minimum, IP address and hostname). 
After these initial changes, the administrator will often 
make further changes over the life of the machine. 
These changes may be required for additional func- 
tionality or security, but are too minor to justify re- 
cloning. Unless order is observed, identical changes 
made to multiple hosts are not guaranteed to behave in 
a predictable way (§A.47). The procedure needed for 
properly maintaining cloned machines is not substan- 
tially different from that described in the section on 
‘Describing Disk State.’ 


This prediction, stated more formally in §A.58, 
seems to apply to UNIX, Windows, and any other 
general-purpose computer with a rewritable disk and 
modern operating system. More generally, it seems to 
apply to any von Neumann machine with rewritable 
nonvolatile storage. 


Management Methods 


All computer systems management methods can 
be classified into one of three categories: divergent, 
convergent, and congruent. 


Divergence 


Divergence (Figure 1) generally implies bad 
management. Experience shows us that virtually all 
enterprise infrastructures are still divergent today. 
Divergence is characterized by the configuration of 
live hosts drifting away from any desired or assumed 
baseline disk content. 


Disk State 





Time 
Figure 1: Divergence. 


One quick way to tell if a shop is divergent is to 
ask how changes are made on production hosts, how 
those same changes are incorporated into the baseline 
build for new or replacement hosts, and how they are 
made on hosts that were down at the time the change 
was first deployed. If you get different answers, then 
the shop is likely divergent. 


The symptoms of divergence include unpre- 
dictable host behavior, unscheduled downtime, unex- 
pected package and patch installation failure, unclosed 
security vulnerabilities, significant time spent “firefight- 
ing,” and high troubleshooting and maintenance costs. 


The causes of divergence are generally that class 
of operations that create non-reproducible change. 
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Divergence can be caused by ad hoc manual changes, 
changes implemented by two independent automatic 
agents on the same host, and other unordered changes. 
Scripts which drive rdist, rsync, ssh, scp, [rdist, rsync, 
ssh] or other change agents as a push operation [boot- 
strap] are also a common source of divergence. 


Convergence 


Convergence (Figure 2) is the process most senior 
systems administrators first begin when presented with 
a divergent infrastructure. They tend to start by manu- 
ally synchronizing some critical files across the 
diverged machines, then they figure out a way to do 
that automatically. Convergence is characterized by the 
configuration of live hosts moving towards an ideal 
baseline. By definition, all converging infrastructures 
are still diverged to some degree. (If an infrastructure 
maintains full compliance with a fully descriptive base- 
line, then it is congruent according to our definition, not 
convergent; see the ‘Congruence’ section. 





Disk State 





Time 
Figure 2: Convergence. 





The baseline description in a converging infras- 
tructure is characteristically an incomplete description 
of machine state. You can quickly detect convergence 
in a shop by asking how many files are currently 
under management control. If an approximate answer 
is readily available and is on the order of a few hun- 
dred files or less, then the shop is likely converging 
legacy machines on a file-by-file basis. 


A convergence tool is an excellent means of 
bringing some semblance of order to a chaotic infras- 
tructure. Convergent tools typically work by sampling 
a small subset of the disk — via a checksum of one or 
more files, for example — and taking some action in 
response to what they find. The samples and actions 
are often defined in a declarative or descriptive lan- 
guage that is optimized for this use. This emulates and 
preempts the firefighting behavior of a reactive human 
systems administrator — “see a problem, fix it.” 
Automating this process provides great economies of 
scale and speed over doing the same thing manually. 


Convergence is a feature of Mark Burgess’ Com- 
puter Immunology principles [immunology]. His 
cfengine is in our opinion the best tool for this job 
[cfengine]. Simple file replication tools [sup, cvsup, 
rsync] provide a rudimentary convergence function, 
but without the other action semantics and_fine- 
grained control that cfengine provides. 
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Because convergence typically includes an inten- 
tional process of managing a specific subset of files, 
there will always be unmanaged files on each host. 
Whether current differences between unmanaged files 
will have an impact on future changes is undecidable, 
because at any point in time we do not know the entire 
set of future changes, or what files they will depend on. 


It appears that a central problem with convergent 
administration of an initially divergent infrastructure is 
that there is no documentation or knowledge as to 
when convergence is complete. One must treat the 
whole infrastructure as if the convergence is incom- 
plete, whether it is or not. So without more informa- 
tion, an attempt to converge formerly divergent hosts 
to an ideal configuration is a never-ending process. By 
contrast, an infrastructure based upon first loading a 
known baseline configuration on all hosts, and limited 
to purely orthogonal and non-interacting sets of 
changes, implements congruence (defined in the next 
section). Unfortunately, this is not the way most shops 
use convergent tools such as cfengine. 


The symptoms of a convergent infrastructure 
include a need to test all changes on all production 
hosts, in order to detect failures caused by remaining 
unforeseen differences between hosts. These failures 
can impact production availability. The deployment 
process includes iterative adjustment of the configura- 
tion tools in response to newly discovered differences, 
which can cause unexpected delays when rolling out 
new packages or changes. There may be a higher inci- 
dence of failures when deploying changes to older 
hosts. There may be difficulty eliminating some of the 
last vestiges of the ad-hoc methods mentioned in the 
section on ‘Divergence.’ Continued use of ad-hoc and 
manual methods virtually ensures that convergence 
cannot complete. 


With all of these faults, convergence still provides 
much lower overall maintenance costs and better relia- 
bility than what is available in a divergent infrastructure. 
Convergence features also provide more adaptive self- 
healing ability than pure congruence, due to a conver- 
gence tool’s ability to detect when deviations from base- 
line have occurred. Congruent infrastructures rely on 
monitoring to detect deviations, and generally call for a 
rebuild when they have occurred. We discuss the secu- 
rity reasons for this in the ‘Congruence’ section. 


We have found apparent limits to how far con- 
vergence alone can go. We know of no previously 
divergent infrastructure that, through convergence 
alone, has reached congruence. This makes sense; 
convergence is a process of eliminating differences on 
an as-needed basis; the managed disk content will 
generally be a smaller set than the unmanaged content. 
In order to prove congruence, we would need to sam- 
ple all bits on each disk, ignore those that are user 
data, determine which of the remaining bits are rele- 
vant to the operation of the machine, and compare 
those with the baseline. 
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In our experience, it is not enough to prove via 
testing that two hosts currently exhibit the same 
behavior while ignoring bit differences on disk; we 
care not only about current behavior, but future behav- 
ior as well. Bit differences that are currently deemed 
not functional, or even those that truly have not been 
exercised in the operation of the machine, may still 
affect the viability of future change directives. If we 
cannot predict the viability of future change actions, 
we cannot predict the future viability of the machine. 


Deciding what bit differences are “‘functional”’ is 
often open to individual interpretation. For instance, 
do we care about the order of lines and comments in 
/etc/inetd.conf? We might strip out comments and 
reorder lines without affecting the current operation of 
the machine; this might seem like a non-functional 
change, until two years from now. After time passes, 
the lack of comments will affect our future ability to 
correctly understand the infrastructure when designing 
a new change. This example would seem to indicate 
that even non-machine-readable bit differences can be 
meaningful when attempting to prove congruence. 


Unless we can prove congruence, we cannot val- 
idate the fitness of a machine without thorough test- 
ing, due to the uncertainties described in §A.25. In 
order to be valid, this testing must be performed on 
each production host, due to the factors described in 
§A.47. This testing itself requires either removing the 
host from production use or exposing untested code to 
users. Without this validation, we cannot trust the 
machine in mission-critical operation. 


Congruence 


Congruence (Figure 3) is the practice of maintain- 
ing production hosts in complete compliance with a 
fully descriptive baseline (see the section on ‘Describ- 
ing Disk State’). Congruence is defined in terms of disk 
state rather than behavior, because disk state can be 
fully described, while behavior cannot (§A.59). 





Disk State 


Time 
Figure 3: Congruence. 


By definition, divergence from baseline disk 
state in a congruent environment is symptomatic of a 
failure of code, administrative procedures, or security. 
In any of these three cases, we may not be able to 
assume that we know exactly which disk content was 
damaged. It is usually safe to handle all three cases as 
a security breach: correct the root cause, then rebuild. 
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You can detect congruence in a shop by asking 
how the oldest, most complex machine in the infras- 
tructure would be rebuilt if destroyed. If years of 
sysadmin work can be replayed in an hour, unat- 
tended, without resorting to backups, and only user 
data need be restored from tape, then host manage- 
ment is likely congruent. 


Rebuilds in a congruent infrastructure are com- 
pletely unattended and generally faster than in any 
other; anywhere from ten minutes for a simple work- 
station to two hours for a node in a complex high- 
availability server cluster (most of that two hours is 
spent in blocking sleeps while meeting barrier condi- 
tions with other nodes). 


Symptoms of a congruent infrastructure include 
rapid, predictable, ‘‘fire-and-forget” deployments and 
changes. Disaster recovery and production sites can be 
easily maintained or rebuilt on demand in a bit-for-bit 
identical state. Changes are not tested for the first time 
in production, and there are no unforeseen differences 
between hosts. Unscheduled production downtime is 
reduced to that caused by hardware and application 
problems; firefighting activities drop considerably. Old 
and new hosts are equally predictable and maintainable, 
and there are fewer host classes to maintain. There are 
no ad-hoc or manual changes. We have found that con- 
gruence makes cost of ownership much lower, and reli- 
ability much higher, than any other method. 


Our own experience and calculations show that 
the return-on-investment (ROI) of converting from 
divergence to congruence is less than 8 months for 
most organizations; see Figure 4. This graph assumes 
an existing divergent infrastructure of 300 hosts, 
2%/month growth rate, followed by adoption of con- 
gruent automation techniques. Typical observed values 
were used for other input parameters. Automation tool 
rollout began at the 6-month mark in this graph, caus- 
ing temporarily higher costs; return on this investment 
is in 5 months, where the manual and automatic lines 
cross over at the 11 month mark. Following crossover, 
we see a rapidly increasing cost savings, continuing 
over the life of the infrastructure. While this graph is 
calculated, the results agree with actual enterprise 
environments that we have converted. There is a CGI 
generator for this graph at Infrastructures.Org, where 
you can experiment with your own parameters. 


Congruence allows us to validate a change on 
one host in a class, in an expendable test environment, 
then deploy that change to production without risk of 
failure. Note that this is useful even when (or espe- 
cially when) there may be only one production host in 
that class. 


A congruence tool typically works by maintain- 
ing a journal of all changes to be made to each 
machine, including the initial image installation. The 
journal entries for a class of machine drive all changes 
on all machines in that class. The tool keeps a lifetime 
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record, on the machine’s local disk, of all changes that 
have been made on a given machine. In the case of 
loss of a machine, all changes made can be recreated 
on a new machine by “replaying” the same journal; 
likewise for creating multiple, identical hosts. The 
journal is usually specified in a declarative language 
that is optimized for expressing ordered sets and sub- 
sets. This allows subclassing and easy reuse of code to 
create new host types; see ‘Describing Disk State.’ 


There are few tools that are capable of the 
ordered lifetime journaling required for congruent 
behavior. Our own isconf (described in its own sec- 
tion) is the only specifically congruent tool we know 
of in production use, though cfengine, with some care 
and extra coding, appears to be usable for administra- 
tion of congruent environments. We discuss this in 
more detail in the ‘Cfengine Techniques’ section. 


We recognize that congruence may be the only 
acceptable technique for managing life-critical sys- 
tems infrastructures, including those that: 

e Influence the results of human-subject health 
and medicine experiments 

e Provide command, control, communications, 
and intelligence (C’/) for battlefield and 
weapons systems environments 

e Support command and telemetry systems for 
manned aerospace vehicles, including space- 
craft and national airspace air traffic control 


Our personal experience shows that awareness of 
the risks of conventional host management techniques 
has not yet penetrated many of these organizations. 
This is cause for concern. 


Ordered Thinking 


We have found that designers of automated sys- 
tems administration tools can benefit from a certain 
mindset: 


Traugott & Brown 


Think like a kernel developer, not an application 
programmer. 


A good multitasking operating system is designed 
to isolate applications (and their bugs) from each other 
and from the kernel, and produce the illusion of inde- 
pendent execution. Systems administration is all about 
making sure that users continue to see that illusion. 


Modern languages, compilers, and operating sys- 
tems are designed to isolate applications programmers 
from “the bare hardware” and the low-level machine 
code, and enable object-oriented, declarative, and other 
high-level abstractions. But it is important to remember 
that the central processing unit(s) on a general-purpose 
computer only accepts machine-code instructions, and 
these instructions are coded in a procedural language. 
High-level languages are convenient abstractions, but 
are dependent on several layers of code to deliver 
machine language instructions to the CPU. 


In reality, on any computer there is only one pro- 
gram; it starts running when the machine finishes 
power-on self test (POST), and stops when you kill 
the power. This program is machine language code, 
dynamically linked at runtime, calling in fragments of 
code from all over the disk. These “fragments” of 
code are what we conventionally think of as applica- 
tions, shared libraries, device drivers, scripts, com- 
mands, administrative tools, and the kernel itself — all 
of the components that make up the machine’s operat- 
ing environment. 


None of these fragments can run standalone on 
the bare hardware — they all depend on others. We 
cannot analyze the behavior of any application-layer 
tool as if it were a standalone program. Even kernel 
startup depends on the bootloader, and in some operat- 
ing systems the kernel runtime characteristics can be 
influenced by one or more configuration files found 
elsewhere on disk. 


Cumulative Cost of Ownership 
Manual versus Automatic Host Administration 
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Figure 4: Cumulative costs for fully automated (congruent) versus manual administration. 
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This perspective is opposite from that of an appli- 
cation programmer. An application programmer “sees” 
the system as an axiomatic underlying support infras- 
tructure, with the application in control, and the kernel 
and shared libraries providing resources. A kernel 
developer, though, is on the other side of the syscall 
interface; from this perspective, an application is some- 
thing you load, schedule, confine, and kill if necessary. 


On a UNIX machine, systems administration 
tools are generally ordinary applications that run as 
root. This means that they, too, are at the mercy of the 
kernel. The kernel controls them, not the other way 
around. And yet, we depend on automated systems 
administration tools to control, modify, and occasion- 
ally replace not only that kernel, but any and all other 
disk content. This presents us with the potential for a 
circular dependency chain. 


A common misconception is that ‘‘there is some 
high-level tool language that will avoid the need to 
maintain strict ordering of changes on a UNIX 
machine.” This belief requires that the underlying run- 
time layers obey axiomatic and immutable behavioral 
laws. When using automated administration tools we 
cannot consider the underlying layers to be axiomatic; 
the administration tool itself perturbs those underlying 
layers; see the ‘Circular Dependencies’ section. 


Inspection of high-level code alone is not 
enough. Without considering the entire system and its 
resulting machine language code, we cannot prove 
correctness. For example: 


print "hello\n"; 


This looks like a trivial-enough Perl program; it 
“obviously”’ should work. But what if the Perl inter- 
preter is broken? In other words, a conclusion of 
“simple enough to easily prove’’ can only be made by 
analyzing low-level machine language code, and the 
means by which it is produced. 


“Order Matters” because we need to ensure that 
the machine-language instructions resulting from a set 
of change actions will execute in the correct order, 
with the correct operands. Unless we can prove pro- 
gram correctness at this low level, we cannot prove 
the correctness of any program. It does no good to 
prove correctness of a higher-level program when we 
do not know the correctness of the lower runtime lay- 
ers. If the high-level program can modify those under- 
lying layers, then the behavior of the program can 
change with each modification. Ordering of those 
modifications appears to be important to our ability to 
predict the behavior of the high-level program. (Put 


simply, it is important to ensure that you can step off. 


of the tree limb before you cut through it.) 


The Need for Testing 


Just as we urge tool designers to think like kernel 
developers, we urge systems administrators to think 
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like operating systems vendors — because they are. Sys- 
tems administration is actually systems modification; 
the administrator replaces binaries and alters configura- 
tion files, creating a combination which the operating 
system vendor has never tested. Since many of these 
modifications are specific to a single site or even a sin- 
gle machine, it is unreasonable to assume that the ven- 
dor has done the requisite testing. The systems admin- 
istrator must perform the role of systems vendor, testing 
each unique combination — before the users do. 


Due to modern society’s reliance on computers, 
it is unethical (and just plain bad business practice) for 
an operating system vendor to release untested operat- 
ing systems without at least noting them as such. Bet- 
ter system vendors undertake a rigorous and exhaus- 
tive series of unit, system, regression, application, 
stress, and performance testing on each build before 
release, knowing full well that no amount of testing is 
ever enough (§A.9). They do this in their own labs; it 
would make little sense to plan to do this testing on 
customers’ production machines. 


And yet, IT shops today habitually have no dedi- 
cated testing environment for validating changed operat- 
ing systems. They deploy changes directly to production 
without prior testing. Our own experience and informal 
surveys show that greater than 95% of shops still do 
business this way. It is no wonder that reliability, secu- 
rity, and high availability are still major issues in IT. 


We urge systems administrators to create and use 
dedicated testing environments (§A.42), not inflict 
changes on users without prior testing, and consider 
themselves the operating systems vendors that they 
really are. We urge IT management organizations to 
understand and support administrators in these efforts; 
the return on investment is in the form of lower labor 
costs and much higher user satisfaction. Availability of 
a test environment enables the deployment of auto- 
mated systems administration tools, bringing major 
cost savings; see Figure 4. 


A test environment is useless until we have a 
means to replicate the changes we made in testing 
onto production machines. ‘“‘Order matters” when we 
do this replication; an earlier change will often affect 
the outcome of a later change. This means that 
changes made to a test machine must later be 
“replayed” in the same order on the machine’s pro- 
duction counterpart; see §A.45. 


Testing costs can be greatly reduced by limiting 
the number of unique builds produced; this holds true 
for both vendors and administrators. This calls for 
careful management of changes and host classes in an 
IT environment, with an intent of limiting prolifera- 
tion of classes; see §A.41. 


Note that use of open-source operating systems 
does not remove the need for local testing of local 
modifications. In any reasonably complex infrastruc- 
ture, there will always be local configuration and non- 
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packaged binary modifications which the community 
cannot have previously exercised. We prefer open 
source; we do not expect it to relieve us from our 
responsibilities though. 


Ordering HOWTO 


Automated systems administration is very 
straightforward. There is only one way for a user-side 
administrative tool to change the contents of disk in a 
running UNIX machine — the syscall interface. The 
task of automated administration is simply to make 
sure that each machine’s kernel gets the right system 
calls, in the right order, to make it be the machine you 
want it to be. 


Describing Disk State 


If there are N bits on a disk, then there are af 
possible disk states. In order to maintain the baseline 
host description needed for congruent management, 
we need to have a way to describe any arbitrary disk 
state in a highly compressed way, preferably in a 
human-readable configuration file or script. For the 
purposes of this description, we neglect user data and 
log files — we want to be able to describe the root- 
owned and administered portions of disk. ‘Order Mat- 
ters’? whether creating or modifying a disk: 


A concise and reliable way to describe any arbi- 
trary state of a disk is to describe the procedure 
for creating that state. 


This procedure will include the initial state (bare- 
metal build) of the disk, followed by the steps used to 
change it over time, culminating in the desired state. 
This procedure must be in writing, preferably in 
machine-readable form. This entire set of information, 
for all hosts, constitutes the baseline description of a 
congruent infrastructure. Each change added to the 
procedure updates the baseline. See the ‘Congruence’ 
section. 


There are tools which can help you maintain and 
execute this procedure. See the ‘Example Tools and 
Techniques’ section, particularly ‘Baseline Manage- 
ment in ISconf.’ 


While it is conceivable that this procedure could 
be a documented manual process, executing these 
steps manually is tedious and costly at best. (Though 
we know of many large mission-critical shops which 
try.) It is generally error-prone. Manual execution of 
complex procedures is one of the best methods we 
know of for generating divergence. 


The starting state (bare-metal install) description 
of the disk may take the form of a network install 
tool’s configuration file, such as that used for Solaris 
Jumpstart or RedHat Kickstart. The starting state 
might instead be a bitstream representing the entire 
initial content of the disk (usually a snapshot taken 
right after install from vendor CD). The choice of 
which of these methods to use is usually dependent on 
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the vendor-supplied install tool — some will support 
either method, some require one or the other. 


How to Break an Enterprise 


A systems administrator, whether a human or a 
piece of software (§A.36), can easily break an enter- 
prise infrastructure by executing the right actions in 
the wrong order. In this section, we will explore some 
of the ways this can happen. 


Right Commands, Wrong Order 


First we will cover a trivial but devastating 
example that is easily avoided. This once happened to 
a colleague while doing manual operations on a 
machine. He wanted to clean out the contents of a 
directory which ordinarily had the development 
group’s source code NFS mounted over top of it. Here 
is what he wanted to do: 


umount /apps/src 
ed /apps/sre 

rm. “rt 

mount /apps/src 


Here’s what he actually did: 


umount /apps/src 
umount fails, directory in use; 
while resolving this, his pager goes 


off, he handles the interrupt, then.. 


ed /apps/sre 
rm -rf 


Needless to say, there had also been no backup 
of the development source tree for quite some time... 


In this example, ‘correct order” includes some 
concept of sufficient error handling. We show this 
example because it highlights the importance of a 
default behavior of “‘halt on error” for automatic sys- 
tems administration tools. Not all tools halt on error 
by default; isconf does. 

Right Packages, Wrong Order 

We in the UNIX community have long accused 
Windows developers of poor library management, due 
to the fact that various Windows applications often 
come bundled with differing versions of the same 
DLLs. It turns out that at least some UNIX and Linux 
distributions appear to suffer from the same problem. 


Jeffrey D’Amelia and John Hart [hart] demon- 
strated this in the case of RedHat RPMs, both official 
and contributed. They showed that the order in which 
you install RPMs can matter, even when there are no 
applicable dependencies specified in the package. We 
don’t assume that this situation is restricted to RPMs 
only — any package management system should be 
susceptible to this problem. An interesting study 
would be to investigate similar overlaps in vendor- 
supplied packages for commercial UNIX distributions. 

Detecting this problem for any set of packages 
involves extensive analysis by talented persons. In the 
case of [hart], the authors developed a suite of global 
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analysis tools, and repeatedly downloaded and 
unpacked thousands of RPMs. They still only saw 
“the tip of the iceberg” (their words). They intention- 
ally ignored the actions of postinstall scripts, and they 
had not yet executed any packaged code to look for 
behavioral interactions. 


Avoiding the problem is easier; install the pack- 
ages, record the order of installation, test as usual, and 
when satisfied with testing, install the same packages 
in the same order on production machines. 


While we’ve used packages in this example, we'd 
like to remind the reader that these considerations apply 
not only to package installation, but to any other change 
that affects the root-owned portions of disk. 


Circular Dependencies 


There is a “chicken and egg” or bootstrapping 
problem when updating either an automated systems 
administration tool (ASAT) or its underlying founda- 
tions (§A.40). Order is important when changes the 
tool makes can change the ability of the tool to make 
changes. 


For example, cfengine version 2 includes new 
directives available for use in configuration files. 
Before using a new configuration file, the new version 
of cfengine needs to be installed. The new client is 
named ‘cfagent’ rather than ‘cfengine,’ so wrapper 
scripts and crontab entries should also be updated, and 
so on. 


For fully automated operation on hundreds or 
thousands of machines, we would like to be able to 
upgrade cfengine under the control of cfengine 
($A.46). We want to ensure that the following actions 
will take place on all machines, including those cur- 
rently down: 

1. fetch configuration file containing the follow- 
ing instructions 

2. install new cfagent binary 

3. run cfkey to generate key pair 

4. fetch new configuration file containing version 
2 directives 

5. update calling scripts and crontab entries 


There are several ordering considerations here. 
We won’t know that we need the new cfagent binary 
until we do step 1. We shouldn’t proceed with step 4 
until we know that 2 and 3 were successful. If we do 5 
too early, we may break the ability for cfengine to 
operate at all. If we do step 4 too early and try to run 
the resulting configuration file using the old version of 
cfengine, it will fail. 


While this example may seem straightforward, 
implementing it in a language which does not by 
default support deterministic ordering requires much 
use of conditionals, state chaining, or equivalent. If this 
is the case, then code flow will not be readily apparent, 
making inspection and edits error-prone. Infrastructure 
automation code runs as root and has the ability to stop 
work across the entire enterprise; it needs to be simple, 
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short, and easy for humans to read, like security-related 
code paths in tools such as PGP or ssh. 


If the tool’s language does not support ‘halt on 
error” by default, then it is easy to inadvertently allow 
later actions to take place when we would have pre- 
ferred to abort. Going back to our cfengine example, if 
we can easily abort and leave the cfengine version | 
infrastructure in place, then we can still use version | 
to repair the damage. 


Other Sources of Breakage 


There are many other examples we could show, 
some including multi-host ‘barrier’ problems. These 
include: 

e Updating ssh to openssh on hundreds of hosts 
and getting the authorized_keys and/or protocol 
version configuration out of order. This can 
greatly hinder further contact with the target 
hosts. Daniel Hagerty [hagerty] ran into this 
one; many of us have been bitten by this at 
some point. 

¢ Reconfiguring network routes or interfaces 
while communicating with the target device via 
those same routes or interfaces. Ordering errors 
can prevent further contact with the target, and 
often require a physical visit to resolve. This is 
especially true if the target is a workstation 
with no remote serial console access. Again, 
most readers have had this happen to them. 


Example Tools and Techniques 


While there are many automatic systems admin- 
istration tools (ASAT) available, the two we are most 
familiar with are cfengine and our own _ isconf 
[cfengine, isconf]. In the next two sections, we will 
look at these two tools with a focus on how each can 
be used to create deterministic ordering. 


In general, some of the techniques that seem to 
work well for the design and use of most ASATs 
include: 

¢ Keep the ‘‘Turing tape”’ a finite size by holding 
the network content constant (§A.23), or ver- 
sioning it using CVS or another version control 
tool [cvs, bootstrap]. This helps prevent some 
of the more insidious behaviors that are a 
potential in self-modifying machines (§A.40). 

e Continuing in that vein, when using distributed 
package repositories such as the public Debian 
[debian] package server infrastructure, always 
specify version numbers when automating the 
installation of packages, rather than let the 
package installation tool (in Debian’s case apt- 
get) select the latest version. If you do not spec- 
ify the package version, then you may intro- 
duce divergence. This risk varies, of course, 
depending on your choice of ‘stable’ or ‘unsta- 
ble’ distribution, though we suspect it still 
applies in ‘stable,’ especially when using the 
‘security’ packages. It certainly applies in all 
cases when you need to maintain your own 
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kernel or kernel modules rather than using the 
distributed packages. 

We have experienced this repeatedly — 
machines which built correctly the first time 
with a given package list will not rebuild with 
the same package list a few weeks later, due to 
package version changes on the public servers, 
and resulting unresolved incompatibilities with 
local conditions and configuration file contents. 
Remember, your hosts are unique in the world 
— there are likely no others like them. Package 
maintainers cannot be expected to test every 
configuration, especially yours. You must retain 
this responsibility. See ‘The Need for Testing.’ 


We use Debian in this example because it is a 
distribution we like a lot; note that other pack- 
age distribution and installation infrastructures, 
such as the RedHat up2date system, also have 
this problem. 

e Expect long dependency or sequence chains 
when building enterprise infrastructures. If an 
ASAT can easily support encapsulation and 
ordering of 10, 50, or even 100 complex atomic 
actions in a single chain, then it is likely capa- 
ble of fully automated administration of 
machines, including package, kernel, build, and 
even rebuild management. If the ASAT is cum- 
bersome to use when chains become only two 
or three actions deep, then it is likely most 
suited for configuration file management, not 
package, binary, or kernel manipulation. 


ISconf Techniques 


As mentioned in the Foreword, isconf originally 
began life as a quick hack. Its basic utility has proven 
itself repeatedly over the last eight years, and as adoption 
has grown it is currently managing more production 
infrastructures than we are personally aware of. 


While we show some ISconf makefile examples 
here, we do not show any example of the top-level 
configuration file which drives the environment and 
targets for ‘make.’ It is this top-level configuration 
file, and the scripts which interpret it, which are the 
core of ISconf and enable the typing or classing of 
hosts. These top-level facilities also are what govern 
the actions ISconf is to take during boot versus cron or 
other execution contexts. More information and code 
is available at ISconf.org and Infrastructures.Org. 


We also do not show here the network fetch and 
update portions of ISconf, and the way that it updates 
its own code and configuration files at the beginning 
of each run. This default behavior is something that 
we feel is important in the design of any automated 
systems administration tool. If the tool does not sup- 
port it, end-users will have to figure out how to do it 
safely themselves, reducing the usability of the tool. 


ISconf Version 2 


Version 2 of ISconf was a late-90’s rewrite to 
clean up and make portable the lessons learned from 
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version |. As in version 1, the code used was Bourne 
shell, and the state engine used was ‘make.’ 


In Listing 1, we show a simplified example of 
Version 2 usage. While examples related to this can be 
found in [hart] and in our own makefiles, real-world 
usage is usually much more complex than the example 
shown here. We’ve contrived this one for clarity of 
explanation. 


In this contrived example, we install two pack- 
ages which we have not proven orthogonal. We in fact 
do not wish to take the time to detect whether or not 
they are orthogonal, due to the considerations 
expressed in §A.58. We may be tool users, rather than 
tool designers, and may not have the skillset to deter- 
mine orthogonality, as in §A.54. 


These packages might both affect the same 
shared library, for instance. Again according to [hart] 
and our own experience, it is not unusual for two 
packages such as these to list neither as prerequisites, 
so we might gain no ordering guidance from the pack- 
age headers either. 


In other words, all we know is that we installed 
package ‘foo,’ tested and deployed it to production, and 
then later installed package ‘bar,’ tested it and deployed. 
These installs may have been weeks or months apart. 
All went well throughout, users were happy, and we 
have no interest in unpacking and analyzing the contents 
of these packages for possible reordering for any reason; 
we’ve gone on to other problems. 


Because we know this order works, we wish for 
these two packages, ‘foo’ and ‘bar,’ to be installed in the 
same order on every future machine in this class. This 
makefile will ensure that; make always iterates over a 
prerequisite list in the same order. 


The touch $@ command at the end of each stanza 
will prevent this stanza from being run again. The 
ISconf code always changes to the timestamps directory 
before starting ‘make’ (and takes other measures to con- 
strain the normal behavior of ‘make,’ so that we never 
try to “rebuild” this target either). 


The class name in this case (Listing 1) is 
‘Block12.’ You can see that ‘Block12’ is also made up 
of many other packages; we don’t show the makefile 
stanzas for these here. These packages are listed as pre- 
requisites to ‘Block12,’ in chronological order. Note 
that we only want to add items to the end of this list, 
not the middle, due to the considerations expressed in 
section §A.49. 


In this example, even though we take advantage 
of the Debian package server infrastructure, we specify 
the version of package that we want, as in the introduc- 
tion to the ‘Example Tools and Techniques’ section. We 
also use a caching proxy when fetching Debian pack- 
ages, in order to speed up our own builds and reduce 
the load on the Debian servers to a minimum. 


Note that we get “halt-on-error” behavior from 
‘make,’ as we wished for in ‘Right commands, Wrong 
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Order.’ If any of the commands in the ‘foo’ or ‘bar’ 
sections exit with a non-zero return code, then ‘make’ 
aborts processing immediately. The ‘touch’ will not 
happen, and we normally configure the infrastructure 
such that the ISconf failure will be noticed by a moni- 
toring tool and escalated for resolution. In practice, 
these failures very rarely occur in production; we see 
and fix them in test. Production failures, by the defini- 
tion of congruence, usually indicate a systemic, secu- 
rity, or organizational problem; we don’t want them 
fixed without human investigation. 


Blockl12: cvs ntp foo lynx wget \ 


serial_console bar sudo mirror_rootvg 


foo: 
apt-get -y install foo=0.17-9 
touch $@ 

bar: 
apt-get -y install bar=1.0.2-1 
echo apple pear > /etc/bar.conf 
touch $@ 


Listing 1: ISconf makefile package ordering example. 


ISconf Version 3 


ISconf version 3 was a rewrite in Perl, by Luke 
Kanies. This version adds more “lessons learned,” 
including more fine-grained control of actions as 
applied to target classes and hosts. There are more lay- 
ers of abstraction between the administrator and the 
target machines; the tool uses various input files to 
generate intermediate and final file formats which 
eventually are fed to ‘make.’ 


One feature in particular is of special interest for 
this paper. In ISconf version 2, the administrator still 
had the potential to inadvertently create unordered 
change by an innocent makefile edit. While it is possi- 
ble to avoid this with foreknowledge of the problem, 
version 3 uses timestamps in an intermediate file to 
prevent it from being an issue. 


The problem which version 3 fixes can be repro- 
duced in version 2 as follows; refer to Listing 1. If 
both ‘foo’ and ‘bar’ have been executed (installed) on 
production machines, then the administrator adds 
‘baz’ as a prerequisite to ‘bar,’ then this would qualify 
as ‘editing prior actions” and create the divergence 
described in (§A.49). 


ISconf version 3, rather than using a human- 
edited makefile, reads other input files which the 
administrator maintains, and generates intermediate 
and final files which include timestamps to detect the 
problem and correct the ordering. 


ISconf Version 4 

ISconf version 4, currently in prototype, repre- 
sents a significant architectural change from versions 
1 through 3. If the current feature plan is fully imple- 
mented, version 4 will enable cross-organizational col- 
laboration for development and use of ordered change 
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actions. A core requirement is decentralized develop- 
ment, storage, and distribution of changes. It will 
enable authentication and signing, encryption, and 
other security measures. We are likely to replace 
‘make’ with our own state engine, continuing the 
migration begun in version 3. See ISconf.Org for the 
latest information. 


Baseline Management in [Sconf 


In the ‘Congruence’ section, we discussed the 
concept of maintaining a fully descriptive baseline for 
congruent management. In the ‘Describing Disk State’ 
section, we discussed in general terms how this might 
be done. In this section, we will show how we do it in 
isconf. 


First, we install the base disk image, usually 
using vendor-supplied network installation tools. We 
discuss this process more in [bootstrap]. We might 
name this initial image ‘Block00’. Then we use the 
process we mentioned in the ‘ISconf Version 2’ sec- 
tion to apply changes to the machine over the course 
of its life. Each change we add updates our concept of 
what is the ‘baseline’ for that class of host. 


As we add changes, any new machine we build 
will need to run isconf longer on first boot, to add all 
of the accumulated changes to the Block00 image. 
After about forty minutes’ worth of changes have built 
up on top of the initial image, it helps to be able to 
build one more host that way, set the hostname/IP to 
‘baseline,’ cut a disk image of it, and declare that new 
image to be the new baseline. This infrequent snapshot 
or checkpoint not only reduces the build time of future 
hosts, but reduces the rebuild time and chance of error 
in rebuilding existing hosts — we always start new 
builds from the latest baseline image. 


In an isconf makefile, this whole process is 
reflected as in Listing 2. Note that whether we cut a 
new image and start the next install from that, or if we 
just pull an old machine off the shelf with a Block00 
image and plug it in, we’ll still end up with a Block20 
image with apache and a 2.2.12 kernel, due to the way 
the makefile prerequisites are chained. 


This example shows a simple, linear build of 
successive identical hosts with no “branching” for 
different host classes. Classes add slightly more com- 
plexity to the makefile. They require a top-level con- 
figuration file to define the classes and target them to 
the right hosts, and they require wrapper script code to 
read the config file. 


There is a littke more complexity to deal with 
things that should only happen at boot, and that can 
happen when cron runs the code every hour or so. 
There are examples of all of this in the isconf-2i pack- 
age available from ISconf.Org. 


Cfengine Techniques 


Cfengine is likely the most popular purpose-built 
tool for automated systems administration today. The 
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cfengine language was optimized for dynamic prerequi- 
site analysis rather than long, deterministic ordered sets. 


While the cfengine language wasn’t specifically 
optimized for ordered behavior, it is possible to 
achieve this with extra work. It should be possible to 
greatly reduce the amount of effort involved, by using 
some tool to generate cfengine configuration files 
from makefile-like (or equivalent) input files. One 
good starting point might be Tobias Oetiker’s Tem- 
plateTree II {oetiker]. 


Automatic generation of cfengine configuration 
files appears to be a near-requirement if the tool is to 
be used to maintain congruent infrastructures; the 
class and action-type structures tend to get relatively 
complex rather fast if congruent ordering, rather than 
convergence, is the goal. 


Other gains might be made from other features 
of cfengine; we have made progress experimenting 
with various helper modules, for instance. Another 
technique that we have put to good use is to imple- 
ment atomic changes using very small cfengine 
scripts, each equivalent to an ISconf makefile stanza. 
These scripts we then drive within a deterministically 
ordered framework. 


In the cfengine version 2 language there are new 
features, such as the FileExists() evaluated class func- 
tion, which may reduce the amount of code. So far, 
based on our experience over the last few years in trial 
attempts, it appears that a cfengine configuration file 
that does the same job as an ISconf makefile would 
still need anywhere from two to three times the num- 
ber of lines of code. We consider this an open and 
evolving effort though — check the cfengine.org and 
Infrastructures.Org websites for the latest information. 


Brown/Traugott Turing Equivalence 


If it should turn out that the basic logics of a 
machine designed for the numerical solution of 
differential equations coincide with the logics of 
a machine intended to make bills for a depart- 
ment store, | would regard this as the most amaz- 
ing coincidence that I have ever encountered. 
— Howard Aiken, founder of Harvard’s Computer 
Science department and architect of the IBM/ 
Harvard Mark I. 


Turing equivalence in host management appears 
to be a new factor relative to the age of the computing 
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industry. The downsizing of mainframe installations 
and distribution of their tasks to midrange and desktop 
machines by the early 1990’s exposed administrative 
challenges which have taken the better part of a 
decade for the systems administration community to 
understand, let alone deal with effectively. 


Older computing machinery relied more on dedi- 
cated hardware rather than software to perform many 
administrative tasks. Operating systems were limited 
in their ability to accept changes on the fly, often 
requiring recompilation for tasks as simple as adding 
terminals or changing the time zone. Until recently, 
the most popular consumer desktop operating system 
still required a reboot when changing IP address. 


In the interests of higher uptime, modern ver- 
sions of UNIX and Linux have eliminated most of 
these issues; there is very little software or configura- 
tion management that cannot be done with the 
machine ‘“‘live.”” We have evolved to a model that is 
nearly equivalent to that of a Universal Turing 
Machine, with all of its benefits and pitfalls. To avoid 
this equivalence, we would need to go back to shutting 
operating systems down in order to administer them. 
Rather than go back, we should seek ways to go fur- 
ther forward; understanding Turing equivalence 
appears to be a good next step. 


This situation may soon become more critical, 
with the emergence of ‘‘soft hardware.” These sys- 
tems use Field-Programmable Gate Arrays to emulate 
dedicated processor and peripheral hardware. Newer 
versions of these devices can be reprogrammed, while 
running, under control of the software hosted on the 
device itself [xilinx]. This will bring us the ability to 
modify, for instance, our own CPU, using high-level 
automated administration tools. Imagine not only acci- 
dentally unconfiguring your Ethernet interface, but 
deleting the circuitry itself. . . 


We have synthesized a thought experiment to dem- 
onstrate some of the implications of Turing equivalence 
in host management, based on our observations over the 
course of several years. The description we provide here 
is not as rigorous as the underlying theories, and much 
of it should be considered as still subject to proof. We do 
not consider ourselves theorists; it was surprising to find 
ourselves in this territory. The theories cited here pro- 
vided inspiration for the thought experiment, but the goal 
is practical management of UNIX and other machines. 
We welcome any and all future exploration, pro or con. 
See the ‘Conclusion and Critique’ section. 


4} 01 Feb 97 - BlockOO is initial disk install from vendor cd, 


## with ntp etc. added later 
BlockOO: ntp cvs lynx 


## 15 Jul 98 - got tired of waiting for additions to Block0O to build, 
## cut new baseline image, later add ssh etc. 


Block10: BlockOO ssh 


f} 17 Jan 99 - new baseline again, later add apache, rebuild kernel, etc. 


Block20: BlocklO apache kernel-2.2.12 


Listing 2: Baseline management in an ISconf makefile. 
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In the following description of this thought experi- 
ment, we will develop a model of system administration 
starting at the level of the Turing machine. We will 
show how a modern self-administered machine is equiv- 
alent to a Turing machine with several tapes, which is in 
turn equivalent to a single-tape Turing machine. We will 
construct a Turing machine which is able to update its 
own program by retrieving new instructions from a net- 
work-accessible tape. We will develop the idea of con- 
figuration management for this simpler machine model, 
and show how problems such as circular dependencies 
and uncertainty about behavior arise naturally from the 
nature of computation. 


We will discuss how this Turing machine relates 
to a modern general-purpose computer running an auto- 
matic administration tool. We will introduce the impli- 
cations of the self-modifying code which this arrange- 
ment allows, and the limitations of inspection and test- 
ing in understanding the behavior of this machine. We 
will discuss how ordering of changes affects this behav- 
ior, and how deterministically ordered changes can 
make its behavior more deterministic. 


We will expand beyond single machines into the 
realm of distributed computing and management of 
multiple machines, and their associated inspection and 
testing costs. We will discuss how ordering of changes 
affects these costs, and how ordered change apparently 
provides the lowest cost for managing an enterprise 
infrastructure. 


Readers who are interested in applied rather than 
mathematical or theoretical arguments may want to 
review the previous sections or skip to the conclusion. 


A.1—A Turing machine (Figure 5) reads bits from 
an infinite tape, interprets them as data according to a 
hardwired program and rewrites portions of the tape 
based on what it finds. It continues this cycle until it 
reaches a completion state, at which time it halts [tur- 
ing]. 

A.2 — Because a Turing machine’s program is 
hardwired, it is common practice to say that the pro- 
gram describes or is the machine. A Turing machine’s 
program is stated in a descriptive language which we 
will call the machine language. Using this language, 
we describe the actions the machine should take when 
certain conditions are discovered. We will call each 
atom of description an instruction. An example 
instruction might say: 


If the current machine state is ‘s3’, and the tape 
cell at the machine’s current head position con- 
tains the letter ‘W’, then change to state ‘s7’, 
overwrite the “‘W’ with a ‘P’, and move the tape 
one cell to the right. 


Each instruction is commonly represented as a 
quintuple; it contains the letter and current state to be 
matched, as well as the letter to be written, the tape 
movement command, and the new state. The instruc- 
tion we described above would look like: 
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s3,W > s/7,P,r 


Note that a Turing machine’s language is in no way 
algorithmic; the order of quintuples in a program listing 
is unimportant; there are no branching, conditional, or 
loop statements in a Turing machine program. 


AB 
s0,1:s3.0,R 
s4,0:s7,1,L 
s2,.0:s2,1,L 

current_state=s2 










Q1001011010010110001101001% 


Figure 5: Turing machine block diagram; the 
machine reads and writes an infinite tape and 
updates an internal state variable based on a hard- 
wired or stored ruleset. 


A.3 — The content of a Turing tape is expressed 
in a language that we will call the input language. A 
Turing machine’s program is said to either accept or 
reject a given input language, if it halts at all. If our 
Turing machine halts in an accept state, (which might 
actually be a state named ‘accept’) then we know that 
our program is able to process the data and produce a 
valid result — we have validated our input against our 
machine. If our Turing machine halts because there is 
no instruction that matches the current combination of 
state and cell content (§A.2), then we know that our 
program is unable to process this input, so we reject. If 
we never halt, then we cannot state a result, so we can- 
not validate the input or the machine. 


A.4 — A Universal Turing Machine (UTM) is able 
to emulate any arbitrary Turing machine. Think of this 
as running a Turing “virtual machine” (TVM) on top of 
a host UTM. A UTM’s machine language program 
($A.2) is made up of instructions which are able to read 
and execute the TVM’s machine language instructions. 
The TVYM’s machine language instructions are the 
UTM’s input data, written on the input tape of the UTM 
alongside the TVM’s own input data (Figure 6). 


UTM Tape 


TVM Program |TVM Data 


Figure 6: The tape of a Universal Turing Machine 
(UTM) stores the program and data of a hosted 
Turing Virtual Machine (TVM). 


Any multiple-tape Turing machine can be repre- 
sented by a single-tape Turing machine, so it is equally 
valid to think of our Universal Turing Machine as hav- 
ing two tapes; one for TVM program, and the other for 
TVM data. 


A Universal Turing Machine appears to be a useful 
model for analyzing the theoretical behavior of a “real” 
general-purpose computer; basic computability theory 
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seems to indicate that a UTM can solve any problem 
that a general-purpose computer can solve [church]. 


A.5 — Further work by John von Neumann and 
others demonstrated one way that machines could be 
built which were equivalent in ability to Universal Tur- 
ing Machines, with the exception of the infinite tape size 
[vonneumann]. The von Neumann architecture is con- 
sidered to be a foundation of modern general purpose 
computers [godfrey]. 


A.6 — As in von Neumann’s “stored program” 
architecture, the TVM program and data are both stored 
as rewritable bits on the UTM tape (§A.4, Figure 6). 
This arrangement allows the TVM to change the 
machine language instructions which describe the TVM 
itself. If it does so, our TVM enjoys the advantages (and 
the pitfalls) of self-modifying code [nordin]. 


A.7 — There is no algorithm that a Turing 
machine can use to determine whether another specific 
Turing machine will halt for a given tape; this is 
known as the “halting problem.” In other words, Tur- 
ing machines can contain constructions which are dif- 
ficult to validate. This is not to say that every machine 
contains such constructions, but that that an arbitrary 
machine and tape chosen at random has some chance 
of containing one. 


A.8 — Note that, since a Turing machine is an 
imaginary construct [turing], our own brain, a pencil, 
and a piece of paper are (theoretically) sufficient to 
work through the tape, producing a result if there is 
one. In other words, we can inspect the code and 
determine what it would do. There may be tools and 
algorithms we can use to assist us in this [laiten- 
berger]. We are not guaranteed to reach a result though 
— in order for us to know that we have a valid machine 
and valid input, we must halt and reach an accept 
state. Inspection is generally considered to be a form 
of testing. 

Inspection has a cost (which we will use later): 

Crnspect 

This cost includes the manual labor required to 

inspect the code, any machine time required for execu- 


tion of inspection tools, and the manual labor to exam- 
ine the tool results. 


A.9 — There is no software testing algorithm that is 
guaranteed to ensure fully reliable program operation 
across all inputs — there appears to be no theoretical 
foundation for one [hamlet]. We suspect that some of 
the reasons for this may be related to the halting prob- 
lem (§A.7), G6del’s incompleteness theorem [godel], 
and some classes of computational intractability prob- 
lems, such as the Traveling Salesman and NP complete- 
ness [greenlaw, garey, brookshear, dewdney]. 


In practice, we can use multiple test runs to 
explore the input domain via a parameter study, equiv- 
alence partitioning [richardson], cyclomatic complex- 
ity analysis [mccabe], pseudo-random input, or other 
means. Using any or all of these methods, we may be 
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able to build a confidence level for predictability of a 
given program. Note that we can never know when 
testing is complete, and that testing only proves incor- 
rectness of a program, not correctness. 


Testing cost includes the manual labor required to 
design the test, any machine time required for execu- 
tion, and the manual labor needed to examine the test 
results: 

Crest 

A.10 — For software testing to be meaningful, we 
must also ensure code coverage. Code coverage 
requirements are generally determined through some 
form of inspection (§A.8), with or without the aid of 
tools. Coverage information is only valid for a fixed 
program — even relatively minor code changes can 
affect code coverage information in unpredictable 
ways [elbaum]. We must repeat testing (§A.9) for 
every variation of program code. 


To ensure code coverage, testing includes the 
manual labor required to inspect the code, any 
machine time required for execution of the coverage 
tools and tests, and the manual labor needed to exam- 
ine the test results. Because testing for coverage 
includes code inspection, we know that testing is more 
expensive than inspection alone: 

Crest > Cinspect 


A.11 — Once we have found a UTM tape that 
produces the result we desire, we can make many 
copies of that tape, and run them through many identi- 
cal Universal Turing Machines simultaneously. This 
will produce many simultaneous, identical results. 
This is not very interesting — what we really want to 
be able to do is hold the TVM program portion of the 
tape constant while changing the TVM data portion, 
then feed those differing tapes through identical 
machines. The latter arrangement can give us a form 
of distributed or parallel computing. 


A.12 — Altering the tapes (§A.11) presents a prob- 
lem though. We cannot in advance know whether these 
altered tapes will provide valid results, or even reach 
completion. We can exhaustively test the same program 
with a wide variety of sample inputs, validating each of 
these. This is fundamentally a time-consuming, pseudo- 
statistical process, due to the iterative validations nor- 
mally required. And it is not a complete solution (§A.9). 


A.13 — If we for some reason needed to solve 
slightly different problems with the distributed machines 
in §A.11, we may decide to use slightly different pro- 
grams in each machine, rather than add functionality to 
our original program. But using these unique pro- 
grams would greatly worsen our testing problem. We 
would not only need to validate across our range of 
input data (§$A.9), but we would also need to repeat 
the process for each program variant (§A.10). We 
know that testing many unique programs will be more 
expensive than testing one: 

Cwciny > Crest 
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A.14 — It is easy to imagine a Turing Machine 
that is connected to a network, and which is able to 
use the net to fetch data from tapes stored remotely, 
under program control. This is simply a case of a mul- 
tiple-tape Turing machine, with one or more of the 
tapes at the other end of a network connection. 

A.15 — Building on §A.14, imagine a Turing Vir- 
tual Machine (TVM) running on top of a networked 
Universal Turing Machine (UTM) (§A.4). In this case, 
we might have three tapes; one for the TVM program, 
one for the TVM data, and a third for the remote net- 
work tape. It is easy to imagine a sequence of TVM 
operations which involve fetching a small amount of 
data from the remote tape, and storing it on the local 
program tape as additional and/or replacement TVM 
instructions (§A.6). We will name the old TVM 
instruction set A. The set of fetched instructions we 
will name B, and the resulting merger of the two we 
will name AB. Note that some of the instructions in B 
may have replaced some of those in A (Figure 7). 
Before the fetch, our TVM could be described (§A.2) 
as an A machine, after the fetch we have an AB 
machine — the TVM’s basic functionality has changed. 
It is no longer the same machine. 


CALBD > 


Figure 7: Instruction set B partially overlays instruc- 
tion set A, creating set AB. 





A.16 — Note that, if any of the instructions in set 
B replace any of those in set A, ($A.15), then the 
order of loading these sets is important. A TVM with 
the instruction set AB will be a different machine than 
one with set BA (Figure 8). 


eee 


Figure 8: Instruction set BA is created by loading B 
before A; A partially overlays B this time. 


A.17 — It is easy to imagine that the TVM in 
§A.15 could later execute an instruction from set B, 
which could in turn cause the machine to fetch another 
set of one or more instructions in a set we will call C, 
resulting in an ABC machine: 


Figure 9: If instructions from set AB load C, then 
ABC results. 


A.18 — After each fetch described in §A.17, the 
local program and data tapes will contain bits from (at 
least) three sources: the new instruction set just copied 
over the net, any old instructions still on tape, and the 
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data still on tape from ongoing execution of all previ- 
ous instructions. 


A.19 — The choice of next instruction to be fetched 
from the remote tape in §A.17 can be calculated by the 
currently available instructions on the local program 
tape, based on current tape content (§A.18). 


A.20 — The behavior of one or more new instruc- 
tions fetched in §A.17 can (and usually will) be influ- 
enced by other content on the local tapes (§A.18). With 
careful inspection and testing we can detect some of the 
ways content will affect instructions, but due to the 
indeterminate results of software testing (§A.9), we may 
never know if we found all of them. 


A.21 — Let us go back to our three TVM instruc- 
tion sets, A, B, and C (§A.17). These were loaded 
over the net and executed using the procedure 
described in §A.19. Assume we start with blank local 
program and data tapes. Assume our UTM is hard- 
wired to fetch set A if the local program tape is found 
to be blank. If we then run the TVM, A can collect 
data over the net and begin processing it. At some 
point later, A can cause set B to be loaded. Our local 
tapes will now contain the TVM data resulting from 
execution of A, and the new TVM machine instruc- 
tions AB. If the TVM later loads C, our program tape 
will contain ABC. 


A.22 — If the networked UTM machine con- 
structed in §A.21 always starts with the same (blank) 
local tape content, and the remote tape content does 
not change, then we can demonstrate that an A TVM 
will always evolve to an AB, then an ABC machine, 
before halting and producing a result. 


A.23 — Assuming the network-resident data 
never changes, we can rebuild our networked UTM at 
any time and restore it to any prior state by clearing 
the local tapes, resetting the machine state, and restart- 
ing execution with the load of A (§A.21). The 
machine will execute and produce the same intermedi- 
ate and final results as it did before, as in §A.22. 


A.24 — If the network-resident data does change, 
though, we may not be able to rebuild to an identical 
state. For example, if someone were to alter the net- 
work-resident master copy of the B instruction set 
after we last fetched it, then it may no longer produce 
the same intermediate results and may no longer fetch 
C (§A.19). We might instead halt at AB. 


A.25 — Without careful (and possibly intractable) 
inspection (§A.8), we cannot prove in advance whether 
an BCA or CAB machine can produce the same result 
as an ABC machine. It is possible that these, or other, 
variations might yield the same result. We can validate 
the result for a given input (§A.3). We would also need 
to do iterative testing (§A.12) to demonstrate that multi- 
ple inputs would produce the same result. Our cost of 
testing multiple or partially ordered sequences is greater 
than that required to test a single sequence: 

Cc 


partial > Crest 
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A.26 — If the behavior of any instruction from B in 
(§A.22) is in any way dependent on other content found 
on tape (§SA.18, §A.19, §A.20), then we can expect our 
TVM to behave differently if we load B before loading 
A (§A.16). We cannot be certain that a UTM loaded 
with only a B instruction set will accept the input lan- 
guage, or even halt, until after we validate it (§A.3). 


A.27 — We might want to rollback from the load 
or execution of a new instruction set. In order to do 
this, we would need to return the local program and 
data tape to a previous content. For example, if 
machine A executes and loads B, our instruction set 
will now be AB. We might rollback by replacing our 
tape with the A copy. 


A.28 — Due to ($A.26), it is not safe to try to roll- 
back the instruction set of machine AB to recreate 
machine A by simply removing the B instructions. Some 
of B may have replaced A. The AB machine, while exe- 
cuting, may have even loaded C already (§A.21), in 
which case you won’t end up with A, but with AC. If 
the AB machine executed for any period of time, it is 
likely that the input data language now on the data tape 
is only acceptable to an AB machine — an A machine 
might reject it or fail to halt (§$A.3). The only safe roll- 
back method seems to be something similar to (§A.27). 


A.29 — It is easy to imagine an automatic process 
which conducts a rollback. For example, in §A.27, 
machine AB itself might have the ability to clear its 
own tapes, reset the machine state, and restart execu- 
tion at the beginning of A, as in §A.23. 


A.30 — But the system described in §A.29 will 
loop infinitely. Each time A executes, it will load B, 
then AB will execute and reset the local tapes again. In 
practice, a human might detect and break this loop; to 
represent this interaction, we would need to add a fourth 
tape, representing the user detection and input data. 


A.31 — It is easy to imagine an automatic process 
which emulates a rollback while avoiding loops, with- 
out requiring the user input tape in §A.30. For exam- 
ple, instruction set C might contain the instructions 
from A that B overlaid. In other words, installing C 
will “rollback”? B. Note that this is not a true rollback; 
we never return to a tape state that is completely iden- 
tical to any previous state. Although this is an imper- 
fect solution, it is the best we seem to be able to do 
without human intervention. 


A.32 — The loop in §A.30 will cause our UTM to 
never reach completion — we will not halt, and cannot 
validate a result (§A.3). A method such as (§A.31) can 
prevent a rollback-induced loop, but is not a true roll- 
back — we never return to an earlier tape content. If 
these, or similar, methods are the only ones available to 
us, it appears that program-controlled tape changes 
must be monotonic — we cannot go back to a previous 
tape content under program control, otherwise we loop. 


A.33 — Let us now look at a conventional appli- 
cation program, running as an ordinary user on a cor- 
rectly configured UNIX host. This program can be 
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loaded from disk into memory and executed. At no 
time is the program able to modify the “master” copy 
of itself on disk. An application program typically 
executes until it has output its results, at which time it 
either sleeps or halts. This application is equivalent to 
a fixed-program Turing machine (§A.1) in the follow- 
ing ways: Both can be validated for a given input 
($A.3) to prove that they will produce results in a 
finite time and that those results are correct. Both can 
be tested over a range of inputs ($A.9) to build confi- 
dence in their reliability. Neither can modify their own 
executable instructions; in the UNIX machine they are 
protected by filesystem permissions; in the Turing 
machine they are hardwired. (We stipulate that there 
are some ways in which §A.33 and §A.1 are not 
equivalent — a Turing machine has a theoretically infi- 
nite tape, for instance.) 


A.34 — We can say that the application program in 
§A.33 is running on top of an application virtual 
machine (AVM). If the application is written in Java, 
for example, the AVM consists of the Java Virtual 
Machine. In Perl, the AVM is the Perl bytecode VM. 
For C programs, the AVM is the kernel system call 
interface. Low-level code in shared libraries used by a 
C program uses the same syscall interface to interact 
with the hardware — shared libraries are part of the C 
AVM. A Perl program can load modules — these 
become part of the program’s AVM. A C or Perl pro- 
gram that uses the system() or exec() function calls 
relies on any executables called — these other executa- 
bles, then, are part of the C or Perl program’s AVM. 
Any executables called via exec() or system() in turn 
may require other executables, shared libraries, or other 
facilities. Many, if not most, of these components are 
dependent on one or more configuration files. These 
components all form an AVM dependency chain for any 
given application. Regardless of the size or shape of 
this chain, all application programs on a UNIX machine 
ultimately interact with the hardware and the outside 
world via the kernel syscall interface. 


A.35 — When we perform system administration 
actions as root on a running UNIX machine, we can 
use tools found on the local disk to cause the machine 
to change portions of that same disk. Those changes 
can include executables, configuration files, and the 
kernel itself. Changes can include the system adminis- 
tration tools themselves, and changed components and 
configuration files can influence the fundamental 
behavior and viability of those same executables in 
unforeseen ways, as in §A.10, as applied to changes in 
the AVM chain (§A.34). 


A.36 — A self-administered UNIX host runs an 
automatic systems administration tool (ASAT) period- 
ically and/or at boot. The ASAT is an application pro- 
gram (§A.33), but it runs as root rather than an ordi- 
nary user. While executing, the ASAT is able to mod- 
ify the “master” copy of itself on disk, as well as the 
kernel, shared libraries, filesystem layout, or any other 
portion of disk, as in §A.35. 
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A.37 — The ASAT described in §A.36 is equiva- 
lent to a Turing Virtual Machine ($A.4) in the ways 
described in §A.33. In addition, a self-administered 
host running an ASAT is similar to a Universal Turing 
Machine in that the ASAT can modify its own pro- 
gram code (§A.6). 


A.38 — A self-administered UNIX host connected to 
a network is equivalent to a network-connected Universal 
Turing Machine (§A.14) in the following ways: The 
host’s ASAT (§A.36) can fetch and execute an arbitrary 
new program as in §A.15. The fetched program can fetch 
and execute another as in §A.17. Intermediate results can 
control which program is fetched next, as in §A.19. The 
behavior of each fetched program can be influenced by 
the results of previous programs, as in §A.20. 


A.39 — When we do administration via auto- 
mated means (§A.36), we rely on the executable por- 
tions of disk, controlled by their configuration files, to 
rewrite those same executables and configuration files 
(§A.35). Like the Universal Turing Machine in §A.32, 
changes made under program control must be assumed 
to be monotonic; non-reversible short of “resetting the 
tape state”’ by reformatting the disk. 


A.40 — An ASAT (§A.36) runs in the context of 
the host kernel and configuration files, and depends 
either directly or indirectly on other executables and 
shared libraries on the host’s disk ($A.26). 


The circular dependency of the ASAT AVM 
dependency tree (§A.34) forces us to assume that, even 
though we may not ever change the ASAT code itself, 
we can unintentionally change its behavior if we 
change other components of the operating system. This 
is similar to the indeterminacy described in §A.20. 


It is not enough for an ASAT designer to stati- 
cally link the ASAT binary and carefully design it for 
minimum dependencies. Other executables, their 
shared libraries, scripts, and configuration files might 
be required by ASAT configuration files written by a 
system administrator — the tool’s end user. 


When designing tools we cannot know whether 
the system administrator is aware of the AVM depen- 
dency tree (we certainly can’t expect them to have 
read this paper). We must assume that there will be 
circular dependencies, and we must assume that the 
tool designer will never know what these dependen- 
cies are. The tool must support some means of dealing 
with them by default. We’ve found over the last sev- 
eral years that a default paradigm of deterministic 
ordering will do this. 


A.41 — We cannot always keep all hosts identical; 
a more practical method, for instance, is to set up classes 
of machines, such as “workstation” and “mail server,” 
and keep the code within a class identical. This reduces 
the amount of coverage testing required (§A.10). This 
testing is similar to that described in §A.13. 


A.42 — The question of whether a particular 
piece of software is of sufficient quality for the job 
remains intractable ($A.9). 
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But in practice, in a mission-critical environ- 
ment, we still want to try to find most defects before 
our users do. The only accurate way to do this is to 
duplicate both program and input data, and validate 
the combination ($A.3). In order for this validation to 
be useful, the input data would need to be an exact 
copy of real-world, production data, as would the pro- 
gram code. Since we want to be able to not only vali- 
date known real-world inputs but also test some possi- 
ble future inputs ($A.9), we expect to modify and dis- 
rupt the data itself. 


We cannot do this in production. Application 
developers and QA engineers tend to use test environ- 
ments to do this work. It appears to us that systems 
administrators should have the same sort of test facili- 
ties available for testing infrastructure changes, and 
should make good use of them. 


A.43 — Because the ASAT (§A.36) is itself a 
complex, critical application program, it needs to be 
tested using the procedure in §A.42. Because the 
ASAT can affect the operation of the UNIX kernel and 
all subsidiary processes, this testing usually will con- 
flict with ordinary application testing. Because the 
ASAT needs to be tested against every class of host 
(§A.41) to be used in production, this usually requires 
a different mix of hosts than that required for testing 
an ordinary application. 


A.44 — The considerations in §A.43 dictate a 
need for an infrastructure test environment for testing 
automated systems administration tools and_tech- 
niques. This environment needs to be separate from 
production, and needs to be as identical as possible in 
terms of user data and host class mix. 


A.45 — Changes made to hosts in the test environ- 
ment (§A.44), once tested (§A.12), need to be trans- 
ferred to their production counterpart hosts. When 
doing so, the ordering precautions in §A.26 need to be 
observed. Over the last several years, we have found 
that if you observe these precautions, then you will see 
the benefits of repeatable results as shown in §A.22. In 
other words, if you always make the same changes first 
in test, then production, and you always make those 
changes in the same order on each host, then changes 
that worked in test will work in production. 


A.46 — Because an ASAT (§A.36) installed on 
many machines must be able to be updated without 
manual intervention, it is our standard practice to 
always have the tool update itself as well as its own 
configuration files and scripts. This allows the entire 
system state to progress through deterministic and 
repeatable phases, with the tool, its configuration files, 
and other possibly dependent components kept in sync 
with each other. 


By having the ASAT update itself, we know that 
we are purposely adding another circular dependency 
beyond that mentioned in §A.40. This adds to the 
urgency of the need for ordering constraints (§A.45). 
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We suspect control loop theory applies here; this 
circular dependency creates a potential feedback loop. 
We need to “break the loop” and prevent runaway 
behavior such as oscillation (replacing the same file 
over and over) or loop lockup (breaking the tool so 
that it cannot do anything anymore). Deterministically 
ordered changes seem to do the trick, acting as an 
effective damper. 


We stipulate that this is not standard practice for 
all ASAT users. But all tools must be updated at some 
point; there are always new features or bug fixes 
which need to be addressed. If the tool cannot support 
a clean and predictable update of its own code, then 
these very critical updates must be done “out of 
band.” This defeats the purpose of using an ASAT, 
and ruins any chance of reproducible change in an 
enterprise infrastructure. 


A.47 — Due to §A.25, if we allow the order of 
changes to be A, B, C on some hosts, and A, C, B on 
others, then we must test both versions of the resulting 
hosts (SA.13). We may have inadvertently created two 
host classes (§A.41); due to the risk of unforeseen inter- 
actions we must also test both versions of hosts for all 
future changes as well, regardless of ordering of those 
future changes. The hosts may have diverged (see the 
‘Divergence’ section). 


A.48 — It is tempting to ask ‘““Why don’t we just 
test changes in production, and rollback if they don’t 
work?” This does not work unless you are able to take 
the time to restore from tape, as in §A.27. There’s also 
the user data to consider — if a change has been applied 
to a production machine, and the machine has run for 
any length of time, then the data may no longer be 
compatible with the earlier version of code (§$A.28). 
When using an ASAT in particular, it appears that 
changes should be assumed to be monotonic (§$A.39). 


A.49 — It appears that editing, removing, or oth- 
erwise altering the master description of prior changes 
(SA.24) is harmful if those changes have already been 
deployed to production machines. Editing previously- 
deployed changes is one cause of divergence. A better 
method is to always “roll forward” by adding new 
corrective changes, as in §A.31. 


A.50 — It is extremely tempting to try to create a 
declarative or descriptive language L that is able to 
overcome the ordering restrictions in §A.45 and 
§A.49. The appeal of this is obvious: “Here are the 
results I want, go make it so.” 


A tool that supports this language would work by 
sampling subsets of disk content, similar to the way 
our Turing machine samples individual tape cells 
(§A.1). The tool would read some instruction set P, 
written in language L by the sysadmin. While sam- 
pling disk content, the tool would keep track of some 
internal state S, similar to our Turing machine’s state 
($A.2). Upon discovering a state and disk sample that 
matched one of the instructions in P, the tool could 
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then change state, rewrite some part of the disk, and 
look at some other part of the disk for something else 
to do. Assuming a constant instruction set P, and a 
fixed virtual machine in which to interpret P, this 
would provide repeatable, validatable results ($A.3). 


A.51 — Since the tool in §A.50 is an ASAT 
(§A.36), influenced by the AVM dependency tree 
(§A.34), it is equivalent to a Turing Virtual Machine as 
in §A.37. This means that it is subject to the ordering 
constraints of §A.45 and §A.47. If the host is net- 
worked, then the behavior shown in §A.15 through 
$A.20 will be evident. 


A.52 — Due to §A.51, there appears to be no lan- 
guage, declarative or imperative, that is able to fully 
describe the desired content of the root-owned, managed 
portions of a disk while neglecting ordering and history. 
This is not a language problem: The behavior of the lan- 
guage interpreter or AVM (§A.34) itself is subject to 
current disk content in unforeseen ways (§A.35). 


We stipulate that disk content can be completely 
described in any language by simply stating the com- 
plete contents of the disk. Cloning, discussed in ‘A 
Prediction,’ is an applied example of this case. This 
class of change seems to be free of the circular depen- 
dencies of an AVM; the new disk image is usually 
applied when running from an NFS or ramdisk root 
partition, not while modifying a live machine. 


A.53 — A tool constructed as in $A.50 is useful 
for a very well-defined purpose; when hosts have 
diverged ($A.47) beyond any ability to keep track of 
what changes have already been made. At this point, 
you have two choices; rebuild the hosts from scratch, 
using a tool that tracks lifetime ordering; or use a con- 
vergence tool to gain some control over them. 
Cfengine is one such tool. 


A.54 — It is tempting to ask ‘*Does every change 
really need to be strictly sequenced? Aren’t some 
changes orthogonal?” By orthogonal we mean that 
the subsystems affected by the changes are fully inde- 
pendent, non-overlapping, cause no conflict, and have 
no interaction each other, and therefore are not subject 
to ordering concerns. 


While it is true that some changes will always be 
orthogonal, we cannot easily prove orthogonality in 
advance. It might appear that some changes are “‘obvi- 
ously unrelated”’ and therefore not subject to sequenc- 
ing issues. The problem is, who decides? We stipulate 
that talent and experience are useful here, for good 
reason: it turns out that orthogonality decisions are 
subject to the same pitfalls as software testing. 


For example, inspection (§A.8) and_ testing 
($A.9) can help detect changes which are not orthogo- 
nal. Code coverage information ($A.10) can be used 
to ensure the validity of the testing itself. 


But in the end, none of these provide assurance 
that any two changes are orthogonal, and like other 
testing, we cannot know when we have tested or 
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inspected for orthogonality enough. As in our Perl 
example in the ‘Ordered Thinking’ section, inspection 
of high-level code alone is not enough either; we can- 
not assume that the underlying layers are correct. 


Due to this lack of assurance, the cost of predict- 
ing orthogonality needs to accrue the potential cost of 
any errors that result from a faulty prediction. This 
error cost includes lost revenue, labor required for 
recovery, and loss of goodwill. We may be able to 
reduce this error cost, but it cannot be zero — a zero 
cost implies that we never make mistakes when ana- 
lyzing orthogonality. Because the cost of prediction 
includes this error cost as well as the cost of testing, 
we know that prediction of orthogonality is more 
expensive than either the testing or error cost alone: 

Coredics 7 Cerror 
Cpredict > Crest 


A.55 — As a crude negative proof, let us take a 
look at what would happen if we were to allow the 
order of changes to be totally unsequenced on a pro- 
duction host. First, if we were to do this, it is apparent 
that some sequences would not work at all, and would 
probably damage the host (§A.26). We would need to 
have a way of preventing them from executing, proba- 
bly by using some sort of exclusion list. In order to 
discover the full list of bad sequences, we would need 
to test and/or inspect each possible sequence. 


This is an intractable problem: the number of 
possible orderings of M changes is M!. If each 
build/test cycle takes an hour, then any number of 
changes beyond seven or eight becomes impractical — 
testing all combinations of eight changes would require 
4.6 years. In practice, we see change sets much larger 
than this; the ISconf version 2i makefile for building 
HACMP clusters, for instance, has sequences as long as 
121 operations — that’s 121!/24/365, or 9.24*10! years. 
It is easier to avoid unsequenced changes. 


The cost of testing and inspection required to 
enable randomized sequencing appears to be greater 
than the cost of testing a subset of all sequences 
(§A.25), and greater than the testing, inspection, and 
accrued error of predicting orthogonality (§A.54): 

Cratdom > Coredict 2 Cnartial 

A.56 — As a self-administering machine changes 
its disk contents, it may change its ability to change its 
disk contents. A change directive that works now may 
not work in the same way on the same machine in the 
future and vice versa (§A.26). There appears to be a 
need to constrain the order of change directives in 
order to obtain predictable behavior. 


A.57 — In contrast to §A.52, a language that sup- 
ports execution of an ordered set of changes appears to 
satisfy §A.56, and appears to have the ability to fully 
describe any arbitrary disk content, as in ‘Describing 
Disk State.’ 


A.58 — In practice, sysadmins tend to make 
changes to UNIX hosts as they discover the need for 
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them; in response to user request, security concern, or 
bug fix. If the goal is minimum work for maximum 
reliability, then it would appear that the “ideal” 
sequence is the one which is first known to work — the 
sequence in which the changes were created and 
tested. This sequence carries the least testing cost. It 
carries a lower risk than a sequence which has been 
partially tested or not tested at all. 


The costs in §A.8, §A.9, §A.25, §A.54, and 
§A.55 are related to each other as shown in Figure 10. 
This leads us to these conclusions: 

e Validating, inspecting, testing, and deploying a 
single ordered sequence (C;,,,) appears to be the 
least-cost host change management technique. 

e Adequate testing of partially-ordered sequences 
(Cyartial) iS More expensive. 

e Predicting orthogonality between partial 
sequences (Cyyeqicr) is yet more expensive. 

e The testing required to enable random change 
sequences (Cyandom) iS More expensive than any 
other testing, due to the N! combinatorial 
explosions involved. 












enable random sequences 


predict orthogonality error) 


test partially ordered 


test ordered sequence 


Figure 10: Relationship between costs of various order- 
ing techniques; larger set size means higher cost. 








A.59 — The behavioral attributes of a complex 
host seem to be effectively infinite over all possible 
inputs, and therefore difficult to fully quantify (§A.9). 
The disk size is finite, so we can completely describe 
hosts in terms of disk content, but we cannot com- 
pletely describe hosts in terms of behavior. We can 
easily test all disk content, but we do not seem to be 
able to test all possible behavior. 


This point has important implications for the 
design of management tools — behavior seems to be a 
peripheral issue, while disk content seems to play a 
more central role. It would seem that tools which test 
only for behavior will always be convergent at best. 
Tools which test for disk content have the potential to 
be congruent, but only if they are able to describe the 
entire disk state. One way to describe the entire disk is 
to support an initial disk state description followed by 
ordered changes, as in ‘Describing Disk State.’ 


A.60 — There appears to be a general statement 
we can make about software systems that run “‘on top 
of” others in a “virtual machine” or other software- 
constructed execution environment (§A.34): 


If any virtual machine instruction has the ability 
to alter the virtual machine instruction set, then 
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different instruction execution orders can pro- 
duce different instruction sets. Order of execu- 
tion of these instructions is critical in determin- 
ing the future instruction set of the machine. 
Faulty order has the potential to remove the abil- 
ity for the machine to update the instruction set 
or to function at all. 


This applies to any application, automatic admin- 
istration tool ($A.36), or shared library code executed 
as root on a UNIX machine (it also applies to other 
cases on other operating systems). These all interact 
with hardware and the outside world via the operating 
system kernel, and have the ability to change that 
same kernel as well as higher-level elements of their 
“virtual machine.”’ This statement appears to be inde- 
pendent of the language of the virtual machine instruc- 
tion set (§A.52). 


Conclusion and Critique 


One interesting result of automated systems 
administration efforts might be that, like the term 
‘computer,’ the term ‘system administrator’ may some- 
day evolve to mean a piece of technology rather than a 
chained human. 


Sometime in the last few years, we began to sus- 
pect that deterministic ordering of host changes may 
be the airfoil of automated systems administration. 
Many other tool designers make use of algorithms that 
specifically avoid any ordering constraint; we accepted 
ordering as an axiom. 


With this constraint in place, we have built and 
maintained many thousands of hosts, in many mis- 
sion-critical production infrastructures worldwide, 
with excellent results. These results included high reli- 
ability and security, low cost of ownership, rapid 
deployments and changes, easy turnover, and excellent 
longevity — after several years, some of our first 
infrastructures are still running and are actively main- 
tained by people we’ve never met, still using the same 
toolset. Our attempts to duplicate these results while 
neglecting ordering have not met these same standards 
as well as we would like. 


In this paper, our first attempt at explaining a theo- 
retical reason why these results might be expected, we 
have not “proven” the connection between ordering 
practice and theory in any mathematical sense. We hope 
we have, however, been able to provide a thought exper- 
iment which will help guide future research. Based on 
this thought experiment, it seems that more in-depth the- 
oretical models may be able to support our practical 
results, 


This work seems to imply that, if hosts are Tur- 
ing equivalent (with the possible exception of tape 
size) and if an automated administration tool is Turing 
equivalent in its use of language, then there may be 
certain self-referential behaviors which we might want 
to either avoid or plan for. This in turn would imply 
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that either order of changes is important, or the host or 
method of administration needs to be constrained to 
less than Turing equivalence in order to make order 
unimportant. The validity of this claim is still an open 
question. In our deployments we have decided to err 
on the side of ordering. 


On tape size: one addition to our “thought exper- 
iment” might be a stipulation that a network-con- 
nected host may in fact be fully equivalent to a Uni- 
versal Turing Machine, including infinite tape size, if 
the network is the Internet. This is possibly true, due 
to the fact that the host’s own network interface card 
will always have a lower bandwidth than the growth 
rate of the Internet itself — the host cannot ever reach 
“the end of the tape.”” We have not explored the impli- 
cations or validity of this claim. If true, this claim may 
be especially interesting in light of the recent trend of 
package management tools which are able to self- 
select, download, and install packages from arbitrary 
servers elsewhere on the Internet. 


Synthesizing a theoretical basis for why “order 
matters” has turned out to be surprisingly difficult. The 
concepts involve the circular dependency chain men- 
tioned in the section on ‘Ordered Thinking,’ the depen- 
dency trees which conventional package management 
schemes support, as well as the interactions between 
these and more granular changes, such as patches and 
configuration file edits. Space and accessibility con- 
cerns precluded us from accurately providing rigorous 
proofs for the points made in the ‘Turing Equivalence’ 
section. Rather than do so, we have tried to express 
these points as hypotheses, and have provided some 
pointers to some of the foundation theories that we 
believe to be relevant. We encourage others to attempt 
to refute or support these assertions. 














Figure 11: Thread structure of Turing Equivalence 
assertions. 
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You are in a maze of twisty little passages, all 
alike. — Will Crowther’s “Adventure” 


There may be useful vulnerabilities or benefits 
hidden in the structure of the ‘Turing Equivalence’ sec- 
tion. Even after the many months we have spent poring 
over it, it is still certainly more complex than it needs to 
be, with many intertwined threads and long chains of 
assumptions (Figure 11). One reason for this complex- 
ity was our desire to avoid forward references within 
that section; we didn’t want to inadvertently base any 
point on circular logic. A much more readable text 
could likely be produced by reworking these threads 
into a single linear order, though that would likely 
require adding the forward references back in. 


For further theoretical study, we recommend: 
¢ Gédel Numbers 
¢ Gédel’s Incompleteness Theorem 
¢ Chomsky’s Hierarchy 
¢ Diagonalization 
¢ The halting problem 
¢ NP completeness and the Traveling Salesman 
Problem 
¢ Theory of ordered sets 
¢ Closed-loop control theory 
Starting points for most of these can be found in 
[greenlaw, garey, brookshear, dewdney]. 
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ABSTRACT 


Server systems invariably write detailed activity logs whose value is widespread, whether 
measuring marketing campaigns, detecting operational trends or catching fraud or intrusion. 
Unfortunately, production volumes overwhelm the capacity and manageability of traditional data 
management systems, such as relational databases. Just loading 1,000,000 records is a big deal 
today, to say nothing of the billions of records often seen in high-end network security, network 
operations and web applications. Since the magnitude of the problem is scaling with increases in 
CPU and networking speeds, it doesn’t help to wait for faster systems to catch up. 


This paper discusses the issues involving large-scale log management, and describes a new 
type of data management platform called a Log Management System, which is specifically 
designed to cost effectively compress, manage and analyze log records in their original, 
unsummarized form. To quote Tom Lehrer, “I have a modest example here” — in this case 
commercial software that can store and process logs in parallel across a cluster of Linux-based 
PCs using a combination of SQL and perl. The paper concludes with some lessons we learned in 


building the system. 
What Is a Log and Why There Is a Problem 


Logs are append-only, timestamped records rep- 
resenting some event that occurred in some computer 
or network device. Once upon a time, logs were used 
by programmers and system administrators to figure 
out “‘what’s going on” inside systems, and weren’t of 
much value to business people. That’s all changed 
with the rise of internet-based communication, online 
shopping, online exchanges, and legal requirements to 
archive traffic and to protect privacy (a.k.a. avoid get- 
ting hacked). Unfortunately, tools to manage log data 
haven’t kept up with the rise in traffic, and people 
have reverted to building custom tools. This paper 
describes a general-purpose solution. 


As a motivating example, one company we’ll call 
ABC Corp. was using a content delivery network 
(CDN) to “accelerate” (cache) the results of image 
requests from their image repository, which stored over 
1,000,000 images. Unfortunately, CDNs are expensive 
and actually slow down the delivery performance for 
images that aren’t frequently accessed. In ABC’s appli- 
cation, the access patterns to the images were tied to 
promotions and other unpredictable criteria. To optimize 
their use of the CDN, they implemented a log manage- 
ment system (LMS) to capture traffic to the image 
repository and dynamically choose whether to use the 
CDN based on the frequency of access. In addition to 
accelerating their content, the system saved ABC 
$10,000 per month in network bandwidth costs. 

Broadly, companies like KeyNote, NetRatings 
(AC Nielsen), DoubleClick, VeriSign, Google and 
Inktomi provide various hosted internet services, and 
need to report on their usage (for marketing), 
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performance (for engineering and 24x7 operations) 
and conformance to service level agreements (SLAs, 
also for operations). Network security applications are 
drowning in log data, coming from system logs, 
routers, firewalls and intrusion detection systems 
(IDSs). 


There are many reasons that traditional data 
management solutions cannot effectively manage log 
data, but the first one that users typically experience is 
in the sheer volume of log data. For example, here are 
some online applications and the volumes they gener- 
ate: 

e Loudcloud: over seven GB per day of security- 
related syslogs 
iPLX: over 20 GB per day for hosting photos on 
eBay. 
topica: over 60 GB per day logging email traf- 
fic. 
TerraLycos: 75 GB per day of weblogs from 12 
major web portals. 
shockwave.com: over 24 GB per day of logs 
about people watching online films and playing 
online games. 
DoubleClick: over 200 GB/day of records 
about people seeing online ads. 


This paper describes a Log Management System 
(LMS) which allows network admins to get their arms 
around their logs without breaking their backs. The 
author envisions never writing another one-off custom 
log analyzer, like he had to do for Inktomi (hotbot), 
bamboo.com (virtual tours) and Internet Pictures 
(eBay Picture Services). 


Previous Solutions and Unresolved Problems 


Until recently, most companies discarded opera- 
tional logs, storing only logs of their financial 
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transactions. Unfortunately, many companies no 
longer have this option: you can’t figure out why 
online shoppers are abandoning their shopping carts 
before completing their purchases unless you look at 
the page views that didn’t lead to sales. 


One solution people attempt is to sample the data, 
then run reports against the samples. Similarly, people 
sometimes run aggregating summaries first, then report 
against the summaries. Both sampling and summaries 
suffer from the following issues. First, you have to plan 
everything in advance — you can’t decide later what 
queries you want, since you’ve discarded the original 
data. Buggy sampling/summarization code results in 
corrupted results forever, another flavor of the 
“changed your mind” problem. Secondly, sampling is 
dangerous: if you don’t sample across the correct 
dimension, you get the wrong answer, which can lead to 
bad business decisions. Lastly, a sample can’t tell you 
whether a something didn’t occur. For example, sam- 
ples and summaries are not useful for security applica- 
tions or when logs are stored for regulatory reasons. 


Another solution is to build an LMS using off- 
the-shelf components, such as relational databases. 
Unfortunately, you still have to deal with parsing 
problems, sequence/session analysis and providing 
tools for non-experts to use, so the LMS author isn’t 
stuck writing every query. All of these solutions also 
need to scale up, i.e., they need to parallelize and sup- 
port paging to disk when running low on RAM. For 
example, parallel sequence analysis is notoriously 
tricky. Relational databases solve some of these scal- 
ing issues, to a point. Unfortunately, even the fastest 
databases can’t load records as fast as enterprise appli- 
cations generate them, much less provide the head- 
room to reload data in case something goes wrong in a 
load. When it comes time to run queries, they depend 
on “‘indexes”’ (e.g., B-trees) which accelerate some 
queries and not others, resulting in “cliffs” where per- 
formance suddenly degrades for no apparent reason. 
For example, a regular expression search in a database 
cannot take advantage of an index. Finally, databases 
are outrageously expensive, both in hardware, soft- 
ware and people to customize and tune them. 


What Does It Mean to Solve the Problem? 


Logs are generated, parsed then indexed and 
compressed — this then allows them to be queried and 
stored, respectively. As all sysadmins know, manage- 
ment tasks are critical, including reorganizing logs 
(e.g., for performance) and retiring them when no 
longer useful. See Figure | for a picture. 


It is worth noting that it is usually impractical to 
keep logs ‘at the edge of the network,” i.e., where 
they were generated. First, enterprises often require 
centralized reports, which becomes difficult when logs 
are separated by slow, unreliable networks and fire- 
walls — or when the log-generating machines lack the 
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storage or CPU power to effectively answer complex 
queries. Finally, managing widely distributed systems 
can be a nightmare, due to heterogeneity of hardware, 
operating systems, tools, access, etc. 





_ Farms of log-generating computers, 
each sending its logs to an LMS 


| _, local or wide-area network 








Figure 1: Workflow of log production and analysis. 





It is also worth noting that scalability affects 
everything you do with logs: not only are excellent 
compressing and indexing basic requirements, but also 
parallel execution. Intuitively, if you have 5,000 devices 
generating logs, you probably need more than one collect- 
ing the results. Practically speaking, the mainframe- 
class system capable of keeping up with a large appli- 
cation’s traffic costs a ridiculous amount of money. 


The Vision 


My vision is for a single piece of software to 
replace the five-minute perl hacks with solid infrastruc- 
ture for handling log data. In doing so, it is key to cre- 
ate a community which shares scripts to parse various 
log formats, create various reports, etc. Ideally, there 
would be a dedicated group of software engineers with 
the time and talent to invest in features like parallel data 
management tools, concurrency control so you can load 
and query data at the same time and connectors to 
front-end tools like MRTG and CrystalReports. 


So we built one. It’s in use at places like topica, 
where they track over one billion emails a month. 
Running the LMS, five PCs running RedHat 7.1 are 
able to load more than 20,000 records per second (rps) 
of weblogs (200-600 bytes/record, depending on the 
site), then query them at rates of over 250,000 rps. 
We’ve handled qmail logs, apache and IIS weblogs, 
syslogs of various kinds, tuxedo logs and numerous 
custom logs. Yes, the LMS is a commercial package — 
it cost us several million dollars to build it. 


Design Decisions for a Scalable LMS 


Architecturally, the Addamark LMS looks like a 
webserver, only it listens for requests to a reserved URI 
(/cgi-app/xmlrpe/execute). If the request contains XML, 
the server parses the request (including data, e.g., for 
loading) and returns results, errors and/or progress indi- 
cators. Behind the scenes, when you connect to a 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


Sah 


server, it parses up your request, and farms it out across 
the cluster. Each host then parses its piece, matches 
table and column names against directories and files in 
its local filesystem (or its NFS-mounted partitions), and 
processes its chunk of the request. 


There is a single config file listing the members 
of each cluster (cluster.xml) and a single config file 
describing the local config options for the given host 
(athttpd.conf). The local config file, for example, 
describes the paths to the data, port to listen on, etc. 
The LMS starts up using an /etc/init.d script. Finally, 
like apache, you can have multiple LMS installations 
per machine, and as long as they have separate paths, 
they can run concurrently. In fact, we even conspired 
to make the lockfiles compatible and the data file 
(backward) compatible, so two installations can share 






older timestamps 


\ leaf subtree 
- (leaf directory containing one 


"segment" of compressed data) 
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the same datastore directories, thereby allowing 
“rolling upgrades,” which is critical for 24x7 opera- 
tions, and also critical for avoiding the nightmare of 
reloading terabytes of data that were loaded over the 
course of months or years. The diagram in Figure 2 
shows what a datastore file-tree might look like. Fig- 
ure 3 depicts an architecture diagram for the LMS 
software. As you can see, we tried to avoid reinvent- 
ing the wheel — even the parallel SQL engine started 
out as Postgres. As you can see, the network protocol 
is XML over HTTP, which makes it quite easy to build 
new clients, including test harnesses. 


Loading 


Requirements. An LMS should handle any type 
of logs, not just “standard” ones. Partly, this is 


a 


No 


recent timestamps 


——_ 4 5 
\Y one compressed 
| ts.gz file per column, 


url.gz with multiple 
respsize.gz | records, ordered 


| by timestamp 


Figure 2: Sample database file tree. 





“ Addamark |: 3rd Party 
Perlcode } PerlLibs ; 


Figure 3: LMS software architecture diagram. 
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because the apps which most need an LMS are exactly will contain name=value pairs, e.g., the GET 
the type who are likely to have large volumes of logs method arguments in a weblog’s URL field. 
and to customize their log formats to save time or * Third party algorithms. It is sometimes the case 
space — or to add special fields they find useful. In our that you need (or want) to reuse some third 
experience, logs tend to exhibit these parsing issues: party code to help parse a log record. For exam- 
* Quoting and escapification. Since logs are usu- ple, if one of the fields is encrypted, you almost 
ally text data separated by some character, you certainly want to use a third party library for 
need some way to handle the case when the decrypting it. 


separator character is present in a given field. Rejected record handling. It is an unfortunate 
¢ Binary data and _ internationalization. These reality that log data almost always contains some 
Sys erie management tool needs to con- number of bogus records that fail to parse. It is 
sider these issues. 
: therefore helpful to have good support for han- 
_ eee : everse DNS. Logs yn "Wie. dling rejected records when debugging parsing 
on ve  iegec TG Ser CAD. SEY" BS “eo scripts. Likewise, in cases when “every byte 
aaa : Ht on Hie Yaue a a. ite i hon counts” (e.g., legal disputes), you will want to 
Se Ee ENS ensure that rejected records aren’t lost. 


fields, where you might want to query on the : ; ; ; 
DNS name they represent. This reverse DNS Excluding and double-loading columns. Some- 
times, users will want to discard a column 


operation can be very expensive, especially (heresy!), e.g., to save space. Assuming you 


when IP addresses fail to resolve. ; ; : 
° Variant and XML records, name=value pairs. oe Se this = be 
a user wi 


Logs with variant records have different “for- e es 

mats” on each line, usually determined by some want to “double-load” a column for faster 
field in first N fields. This is typically found in query performance, e¢.g., load both an IP 
custom application logs, rather than logs from address as well as its DNS name. 

commercial devices. However, XML log records Design Decisions. For performance, we parse 
are becoming more popular. Sometimes, a field logs in parallel across the cluster, using a regular 


## some example records (for compatibility, we also support hash-comments) 

# 199.166.228.8 - - [29/Jan/2002:23:44:37 -0800] "GET / HTTP/1.0" 200 7121 

# "check _http/1.32.2.6 (netsaint-plugins 1.2.9-4)" 0 

# 212.35.97.195 - - [29/Jan/2002:23:45:06 -0800] "GET 

ff /images/1lms_overview_pagel.gif HTTP/1.1" 200 17252 

i http://paulboutin.weblogger.com/2002/01/28" "Mozilla/4.0 

# (compatible; MSIE 6.0; Windows NT 5.0; Q312461)" 1 

# 62.243.230.170 - - [19/Feb/2002:17:29:43 -0800] "POST /cgi-bin/form. pl 
HTTP/1.1" 302 5 "http://addamark.com/product/requestform. html" 

# "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 1 


## vvvvvv this is the regexp used to parse up the records vvvvvvvvv 


Gk BA Bigs von gs atl Noles Drie Nee piace ‘igh few He ah ae whiny A Nard at ee 2 os ol 
ClientIP: VARCHAR, unused1: VARCHAR, unused2: VARCHAR, tsStr: VARCHAR, 
Method: VARCHAR, Url: VARCHAR, HttpVers:VARCHAR, RespCode: INT32, 
RespSize:INT32, Referrer:VARCHAR, UserAgent:VARCHAR, RespTime: VARCHAR 


DDR IRA Bi pets eee re, CA ne ee ie UR et ee ee ee em ARR. 


ae these are the assigned "parse field" names and datatypes 


ae this is the SQL statement used to transform the parse fields 
-- VVVVVVVVV (from the "stdin" table) into the final table records VVVVVV 
SELECT _strptime( tsStr, "%d/%b/%Y:%H:%M:%S %Z") as ts, 

ClientIP, 

_rev_dns(ClientIP) as ClientDNS, -- perform a reverse DNS lookup 

Method, 

Url, 

HttpVers, 

RespCode, 

RespSize, 

Referrer, 

UserAgent, 

_int32(RespTime) as RespTime, -- can also parse strings as numbers here, 
FROM stdin; 


Display 1: Example PTL script for loading an NCSA weblog. 
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expression designed to match single-line records. To 
handle multi-line records, we pre-process the data 
before loading, to force records onto one (virtual) line. 
To provide the flexibility needed, we provide a declar- 
ative language based on SQL. Roughly, a load “‘state- 
ment” is a SELECT from a table whose columns are 
the parse fields, and whose output is used to load the 
data. To transform a field (e.g., using builtin or third- 
party functions) simply place a SQL expression in the 
SELECT statement (the SQL “targets,” as they’re 
called); to exclude a column, just don’t mention it in 
the SELECT statement; to double-load a column, 
mention it twice. For an example, see below (‘‘Load 
Script Language’’). 

In addition, we’ve extended our SQL to support 
functions written in Perl, which you can submit with 
any SQL statement, either at load- or query-time. In 
this way, you can write custom parsers and use third 
party libraries (e.g., perlS modules) to parse the data. 
Using the new Inline Perl module (http://inline.perl. 
org/), you can even dynamically load code written in 
other languages, including C, C++, Java, Ruby, Python, 
etc. In practice, our users have used the Perl interface in 
ways we never expected. As an example, one user 
implemented functions to parse User-Agent tags and 
look for worms. In this way, he could exclude traffic 
that wasn’t related to real users, including worms like 
NIMDA and robot-agents like the google crawler. 


Both the regular expression match, SQL state- 
ment and any embedded Perl code are all run in paral- 
lel across the cluster. In practice, we’ve seen near-lin- 
ear scaleups because parsing is CPU-intensive once 
you include all of the “business rules” of real world 
parsing. 

The Addamark parse-transform-and-load “‘lan- 
guage” (PTL) uses a perl5 regular expression to per- 
form the basic parse, while reusing the SQL and perl 
engines to perform the transformation. Display 1 shows 
an example PTL script for loading an NCSA weblog. 


Again, it is important to note that the entire PTL 
script is executed in parallel across the cluster. Thus, 
even if you embed complex Perl functions or a multi- 
tude of complex regular expressions, you’ll still be 
able to parse tens of thousands of records per second. 
For example, one customer has a PTL script which 
calls a home-brewed parse_useragent function on 
every record as it comes in, rather than doing this 
analysis on every query — although this improves 
query performance, the real value is in having the 
table pre-populated with the various browser attributes 
up-front, which makes query-writing easier. 


Putting it together, Figure 4 shows the architec- 
ture diagram showing how the LMS loads data; each 
box represents a thread of control and set of vertically- 
aligned boxes represents one host. In this example, the 
cluster is of size three. Typically, a loading client 
sends the log data to one of the hosts in the cluster, 
which we call the “master.” Any host can play 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


A New Architecture for Managing Enterprise Log Data 


“master” for any load request; the job of the master is 
to break up the datastream into records, and to farm 
those records out to machines in the cluster. 
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Figure 4: How LMS loads data. 


Parsing and storage then happen in parallel, and 
finally, the records are merged into the existing sets of 
records on the given storage medium. As suggested by 
the diagram, the Addamark LMS can store its log data 
on either local disk, network attached storage (NAS, 
e.g., NFS), or on a storage area network (SAN, e.g., 
FiberChannel). 


It’s not shown in the diagram, but since the 
client-server and server-server are identical, it’s 
straightforward to have client tools load directly into 
LMS datastore nodes, bypassing the need for a master 
(and the scalability bottleneck it creates), at the cost of 
greater configuration complexity. 


Storage and Data Management 


Requirements. An LMS should automatically 
handle all indexing, compression, storage, layout and 
so on — ideally in such a way that queries are then fast 
to run. 

e Indexing, compression, storage and data 
management. Ideally, compression should be 
as good as GZIP, since managed storage is 
expensive. Compression ratios should be rea- 
sonably stable, as should indexing quality — 
basic queries should return in the same amount 
of time regardless of the log data being loaded. 
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e data administration tools for giant tables. 
Although the requirements are rather special- 
ized, an LMS shares many requirements with 
databases — you need to control concurrent 
access, support incremental backup, restore and 
replication, deal with corrupted datastores, 
retire data and manage the evolution of the data 
definition (e.g., add/rename/remove columns). 

e Timestamp support. Every log record has a 
timestamp. Unfortunately, it’s rarely the case 
that log data is all in one format (i.e., need 
functions for parsing them), in GMT (i.e., need 
flexible timezone support) or synchronized 
across multiple sources (i.e., need clock sync 
routines). Some log data contains sub-second 
resolution, so it’s important to handle this. 

e System admin. If you can make the LMS “not 
a server,” then it’s a big win because system 
admin gets easier because it doesn’t have to 
stay up 24x7. Unfortunately, this is probably 
not realistic because it leaves too much of the 
work for users. 
The next best design is to make the LMS into a 
set of CGI scripts, hostable inside a webserver, 
thereby reusing the 24x7 and monitoring sup- 
port found in the webserver. Example issues 
include the ability to list/kill/pause/slow LMS 
operations such as loads or queries, install/ 
uninstall/configure/reconfigure the LMS. 

It is a terrible idea to use threads or custom 

servers for the LMS, because you then need 

new tools to manage it and you’ll have separate 
security issues, etc. 

Security. Access control, authentication and 

security are paramount issues in any data man- 

agement system. 


Design Decisions 


e Indexing, compression, storage and data 
management. We chose gzip and bzip to per- 
form compression for us, first parsing the data 
to get the best possible compression ratio. We 
also employ several tricks that leverage our 
knowledge of particular types of logs, for 
example encoding timestamps as delta-offsets 
from one another, rather than as distinct values. 
The net result is a compression that almost 
always beats gzip by a wide margin, sometimes 
as much as 2x better. 

We chose to store the log data as sets of plain 
files in the filesystem, one per column. Specifi- 
cally, the files are laid out as a hierarchical set 
of directories, broken out by time. For concur- 
rency control, we use lockfiles stored on local 
disk (e.g., /var/...) because locking over NFS 
can be flaky, and we figured that some users 
may want to store their log data on network 
disks. 

We’re careful about touching files, allowing 
administrators to perform incremental replication, 
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backup and restore using tools like rsync(1) and 
find(1). For more information, see “Data Admin- 
istration” below. 

Clustering and fault tolerance. Because the use 
of multiple computers inherently increases the 
chances that one system will fail, we included 
automatic failover. It works by mirroring every 
record across two hosts in the cluster — each host 
has a “sibling” for the data it stores. 

When a computer fails, the hosts that are trying 
to contact it automatically failover to the sib- 
ling, e.g., for running queries. This causes a 
50% performance degradation during failures, 
but 100% performance availability in their 
absence (unlike RAID). 

Since perfectly even distribution is not 
required, when a host fails, loads simply route 
around both a host and its sibling. Since our 
clusters typically start at five hosts, this leaves 
plenty of horsepower even during failures. 
Timestamp support. We support TIMES- 
TAMP as a native datatype, represented as a 
64-bit integer value of microseconds since the 
epoch — Jan 1, 1970. The SQL engine has the C 
library functions strptime() and strftime() built 
in for parsing and printing TIMESTAMPs, 
respectively. 

Timezone support is offered in all timestamp- 
related functions, and the SQL engine supports 
changing its default timezone using the clause 
“WITH TIMEZONE ...” To synchronize 
clocks, load requests from the log-generating 
devices can include their local clocktime, and 
the engine will automatically compute the dif- 
ference and apply it to the log data. 

Internally, we store all data in GMT, which is 
simpler, but which requires users to set the time- 
zone when printing timestamps using “WITH 
TIMEZONE” or in each formatting call. 

System admin. We chose to represent LMS 
operations as sets of Linux processes, allowing 
users to use linux tools (e.g., ps(1)) on them. 
These processes are launched from an off-the- 
shelf webserver (currently thttpd). In addition 
to being able to control jobs with per-machine 
utilities, such as nice(1), we also offer XML- 
RPC calls to control sets-of-processes across 
the cluster. We included scripts to perform 
cluster-wide install/uninstall/reconfigure. 
Concurrency control. The Addamark LMS 
provides a timerange-based concurrency con- 
trol scheme that enables concurrent updates and 
queries. For example, retiring data does not 
block queries or new data loads. Also, two data 
loads will interleave in such a way as to block 
neither one, a critical requirement because 
loads can take a long time, causing timeouts in 
upstream processes. Unlike generalized 
database transactions, loads do not perform 
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reads in between updates, so this interleaving 
doesn’t cause integrity loss. 

Low-level tools. Not that this would ever hap- 
pen, but in practice files can become corrupted 
through software bugs, disk drive problems, etc. 
One customer told us a horror story about losing 
a person-week trying to recover data from a cor- 
rupted Microsoft SQL Server database. 

To reduce this pain, the LMS includes low- 
level tools to manage the data (e.g., read the 
files) that does not depend on the LMS being 
up, and which are resilient to corruption. Like- 
wise, the Addamark LMS includes low-level 
tools for managing files that are replicated 
across a cluster, with cluster-wide diff (‘‘cld- 
iff”), synchronization (‘‘clsync’’), run-a-com- 
mand (‘“‘clssh’’), and so on. 

These tools are read the same cluster.xml con- 
fig file, so membership changes affect both the 
LMS server and the lower-level utilities. How- 
ever, for obvious reasons, the utilities can over- 
ride the membership list. 

Retirement and data evolution. When the defi- 
nition of a log changes, e.g., new columns, you 
need to be able change the definition in the LMS. 
So-called “schema evolution” can be handled 
with standard SQL statements such as ALTER 
TABLE ADD/DROP/RENAME COLUMN. 
Retiring data works using DELETE FROM, 
which allows you to define WHERE and DUR- 
ING clauses to control what data gets deleted. 

By the time you read this, we'll have imple- 
mented a policy manager which provides a nice 
front-end to handle the most common cases. 
Also, since the goal of retirement is to save disk 
space, and since summaries are a tiny fraction of 
the size of the original data, we’re adding facili- 
ties (e.g., INSERT INTO SELECT FROM) to 
retire-to-a-summary and utilities to implement 
the most common policies. 

In a clustered system, retiring to offline media can 
either be done per-system, or unified. The former 
takes advantage of the file-based storage, while 
the latter reuses the query mechanism, which 
already unifies data from across the cluster. 
Security. At one level, LMS security is simple: 
we support SSL access to the cluster. You can 
also use IP blocking, firewalls and/or VPNs to 
restrict access to selected clients. In reality, LMS 
authentication and access control is a complex 
topic, easily filling a paper all by itself. Simulta- 
neously, this is an area of active development for 
us, so any information would be obsolete. Look 
for future reports on the subject. 


Querying and Reporting 


Requirements. From a high-enough level, query- 
ing an LMS is a lot like querying a database. In prac- 
tice, the workload looks quite different, and the LMS 
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should be optimized accordingly. First, for some appli- 
cations, it is important to be able to quickly retrieve 
the original records — whitespace and all — for exam- 
ple, for legal use. More commonly, users want to get 
summaries and histograms, such as traffic per unit- 
time. More sophisticated queries include lookups (e.g., 
resolve ID fields into live data sources, rather than 
during loading) and sequencing/sessionizing queries 
(e.g., recreate web user sessions, or match activity to a 
given router as an “‘attack”’). In practice, real world 
use quickly demands custom filters (SQL WHERE 
clauses), custom counters (SQL aggregates, such as a 
new type of “SUM7’’) and data sources outside the 
LMS (virtual/computed tables). 


Reporting is the ‘higher level’ functionality 
around querying, including metadata queries (‘‘what 
data is available?”), query caching, presentation/for- 
matting (e.g., Microsoft Excel, HTML, XML, etc.) 
and connectivity (e.g., ODBC, JDBC, DBI/DBD, 
etc.). 

Parallel queries use a similar scheme to loading, 
but in reverse. Specifically, the SQL DURING and 
WHERE clauses get executed as part of the filtering 
service, then the results routed across the cluster to the 
“compute” services such that every group (i.e., 
GROUP BY expression) lands on the same host. To 
support parallel GROUP BY and SLICE BY, the 
groups are distributed randomly across the cluster. 
HAVING, which filters groups, is also implemented at 
the compute layer. Finally, ORDER BY and TOP-n 
are implemented at the compute layer, and merged 
together at the master to form the final result. The 
above description is the general case for simple aggre- 
gation queries — fancier cases like JOINs, subqueries, 
UNIONs, etc. are possible as well, but beyond the 
scope of this paper, as are the numerous optimizations 
that we’ve implemented. — 


Design Decisions. For querying, we chose to 
offer a simplified flavor of SQL, make sure it runs in 
parallel, then use the Perl extension mechanism to 
handle the custom needs of log applications. To handle 
sequences/sessions, we extended GROUP BY with 
SLICE BY, which “slices” a group into multiple 
groups based on a user-defined predicate. To handle 
sessions, this predicate can be stateful, e.g., 10 min- 
utes since we’ve seen activity for a given user. The 
design of SLICE BY allows the LMS to “sessionize” 
traffic after it’s been loaded — allowing you to change 
the business definition of a ‘“‘session”’ after the fact — 
and it allows the LMS to sessionize traffic in parallel, 
a critical requirement (see Figure 5). 

We offer “system” tables which contain lists of 
tables, columns, etc. For caching, formatting and con- 
nectivity, we provide a set of client-side tools and con- 
nectors. In addition, we opted to use XMLRPC-over- 
HTTP as our network protocol. This means that you 
can submit queries to the LMS using curl, lynx or 
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even a homebrew perl script — without some fancy 
code library. In practice, partner companies have got- 
ten new clients to work in under an hour, using noth- 
ing but examples. 


We chose an extended flavor of SQL as the basis 
for querying the LMS. Display 2 shows the SQL to 


return the first 100 log records after midnight, Feb 1. 


Each computer 
runs all 3 services 





Figure 5: Database querying architecture. 


The WITH clause sends various parameters to 
the SQL engine; these can be overridden on the com- 
mand-line. In this case, we’re telling the engine to pro- 
duce results in California time, rather than its internal 
time (GMT). 
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The SELECT clause tells the engine what 
columns should appear in the results, and from which 
table to get them. In this case, we want all of the fields 
that appear in a (parsed) syslog. 


The DURING clause is an Addamark extension 
which tells the engine which timerange you’re inter- 
ested in querying, so you don’t accidentally query the 
whole table. In the rare case when you want to query 
everything, you can specify ‘““DURING ALL”. 


To execute this query, you’d run something like: 
atquery lms.myco.com:8072 myquery.sql 


atquery(1) is our command-line utility for sending your 
SQL statement to the server, capturing the response 
(data, errors and/or progress indicators) and pretty- 
printing it to the screen, file, etc. In this example, 
“Ims.myco.com” would be one of the systems in an 
LMS cluster. You can even map the LMS hosts into a 
“virtual IP” behind a load-balancer, which then pro- 
vides additional fault tolerance. 


Here’s a more interesting example, retrieving a 
histogram of website traffic by day for the previous 
month. 


The “WITH $foo” clauses define expression- 
macros, which work like C preprocessor macros. 
We’ve found macros to be lifesavers in practice, espe- 
cially for clauses like DURING. Even better, the client 
tools support “include” which includes other files’ 
worth of macros. This way, you can put the WITH 
TIMEZONE in a central file, then have every query 
affected by it. Finally, the tools also support overriding 
the WITH definitions from the command-line, allow- 
ing you to specify the $start and $end from the com- 
mand-line, even though they were also given defaults 
in the query file. Unlike PL/SQL and other “stored 
procedure” languages, Addamark SQL uses Perl, i.e., 
an industry-standard language (Java and C++ coming 
soon), and the perl code automatically runs in parallel 
across the cluster of PCs. In practice, the CPUs on 
modern PCs can execute Perl code amazingly fast, 
resulting in terrific performance, even for complex 
algorithms containing numerous regular expressions. 


-- dash-dash starts a SQL comment, much like hash (#) in the shell 


WITH TIMEZONE 'US/Pacific’ 


SELECT TOP 100 _timef("%c", ts), hostname, 


FROM syslog 


progname, processID, message 


DURING time(’Aug 08 04:00:00 2001’),time(’max’) 
Display 2: SQL to return first 100 log records after midnight, Feb 1. 





WITH TIMEZONE ’'US/Pacific’ 
WITH Send AS _now() 


WITH Sstart AS _timeadd(Send, -1, "month") 


SELECT _timef("%m/%d/%Y", ts) as ‘date’ 
, COUNT (*) as “hits” 

FROM example_websrv 

GROUP BY 1 

ORDER BY 1 

DURING Sstart, Send 


-- rollup the hits by date (result column #1) 
-- then sort the results by date (result column #2) 


Display 3: Retrieve a histogram of daily web traffic for previous month. 
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_now(), _timeadd() and _timef() are builtin 
Addamark functions. We chose to prefix our builtins 
with underscores to reserve the namespace for other 
uses. _now() returns the timestamp when the query 
was submitted to the LMS; _timeadd() is a builtin 
function which knows how to add timestamps cor- 
rectly, including accounting for the timezone. _timef() 
is a function for formatting timestamps as ASCII 
strings, a direct mapping of the C function strftime(3). 


If you also want to return the aggregate band- 
width per day, simply add a third result target — 
SUM(respsize)/1024.0/1024.0 AS “MB sent.” 


Lastly, to demonstrate the power of embedded 
Perl, Display 4 shows hot to compute the top 25 most 
popular “pages” in the website. Only, let’s normalize 
the webpages, so that URLs like / and /index.htm 
don’t show up separately. 


Lessons Learned 


Here are some of the things we learned imple- 
menting the LMS: 

e PC clustering changes everything. Modern 
networks are very fast and very cheap, and the 
CPUs on modern PCs are also very fast, so 
much that it almost always makes sense to trade 
intra-cluster bandwidth and CPU performance 


WITH TIMEZONE 'US/Pacific’ 
WITH Send AS _now() 


WITH Sstart AS _timeadd(Send, -1, 
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"month") 


for other resources, such as RAM capacity or 
disk I/O. For $70,000 in hardware, you can put 
together a system with over 100 GHz of CPU, 
with more switch bandwidth than the CPUs can 
saturate, and with 72 terabytes in storage, 
including a mirror copy. 

Timestamps are a pain. It is easy to underesti- 
mate the hassles in dealing with timestamps. As a 
simple example, the default routines for parsing 
timezones didn’t recognize ““PDT” (pacific day- 
light time) even though it’s produced by the 
date(1) utility. If you don’t solve issues like 
this, users get annoyed with not being able to 
cut and paste. Another example is the lack of a 
“%Z” in strptime() so you can capture the 
timezone from logs which contain per-record 
timezones. At the user-level, you'll find logs 
that are missing critical time fields, such as sys- 
logs that don’t include the year and weblogs 
lacking the timezone. 

Buffer the logs. Originally, we thought that 
users would want a library for collecting logs — 
but between syslog, weblogs, etc. users have 
plenty of logs already, they just need a manage- 
ment system for them! Typically, they “roll” 
the logs every T time, creating compressed files 
on the log-generating device. So-called ‘‘real- 


-- this defines a new perl function, which can be called from SQL 


WITH normalize_url AS 
sub normalize_url { 
my(Surl) = @_; 


*perl5’ 


FUNCTION <<EOF 


## in this site, index.html pages are the same as trailing-slash pages 


Surl =" s@/index.s?htm1?$@/@; 
## other rules go here... 
# 


# uncomment this to send debug messages back to the client tool 
# i.e., they’re collecting from each of the nodes in the cluster, 
# unified and streamed back over the client-connection as out-of- 


## band messages. 


# 
## addamark::dbgPrint("hello, world"); 


return Surl; 


} 


EOF 
SELECT TOP 25 -- returns the first 25 records, assuming there’s an 
-- ORDER BY to sort them. 
—_perl("normalize_url", url) as url 
, COUNT (*) as ‘hits’ 
, SUM(respsize)/1024.0/1024.0 as 'MB sent’ 
FROM example_websrv 
WHERE respcode < 300 -- ignore HTTP redirects and errors 
GROUP BY 1 


ORDER BY 2 DESC 


DURING $start, Send 


-- this time, sort by the most-popular-first 


Display 4: Compute top 25 most popular “‘pages” in the website. 
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time analytics” turned out to be a red herring: 
99% of applications can live with 5-minute 
response times because people are involved in 
the chain, and they can’t react faster than this. 
Five minutes is plenty to roll a log, compress it 
and send it to a centralized LMS. Wide-area 
collection remains a challenge especially across 
firewalls — but this problem doesn’t seem to 
have a silver bullet. 

Tag the data. The combination of data com- 
pression and columnar storage makes “tags” 
essentially free in most cases (i.e., few unique 
values). Tagging can be used to provide all 
sorts of services, and solve all sorts of problems 
(for example, see “‘guarantees”’ below). 
Guarantees. In theory, end-to-end atomicity 
(“once and only once”) across an enterprise 
requires ‘“‘two phase commit.” In practice, ven- 
dor heterogeneity and the complexity of auto- 
mated recovery make this impractical. Instead, 
we rely on store-and-forward (aka buffering) to 
ensure against data loss. The downside is that 
duplication becomes possible. Fortunately, the 
same data-tagging system we use to allow users 
to track data back to its source also allows the 
LMS to detect and undo duplicate loads. 
Packaging matters. Packaging turned out to be 
surprisingly important. Our decision to “make 
the LMS look like apache” was a big win, 
because it was instantly familiar to users and 
because the config files were easy to explain. 
Likewise, replicating all configuration across the 
cluster made sense to people, while making the 
LMS resilient to individual machine failures. The 
biggest win of looking like a webserver, though, 
was the choice of HTTP as the network protocol, 
including a complete embedded webserver and a 
copy of the docs. This meant that our network 
protocol can be proxied, encrypted, tunneled, etc. 
—all without special support. 


The Future 


The existence of a scalable LMS has changed 
things, but much work remains. First, the combination 
of fast loading, aggressive data compression and PC 
disks has all but made log storage “‘free.’’ Early users 
would worry about running out of disk — until they did 
the math, and realized that even small clusters of PCs 
could store Years of data. Five PCs alone could store a 
month of traffic logs from all of Yahoo! This brings us 
to the second lesson: although you can store years of 
data online, and access to any (short) timerange is 
quick, if you want to analyze the whole thing, it’s 
going to be slow. Therefore, you want to scale the 
LMS — more CPUs, RAM, etc. — according to the 
“working set size” rather than disk capacity. Thus, a 
balanced system would have a tiny disk drive. But 
extra disk capacity is cheap, so in practice users buy 
far more than they need. In other words, storage 
capacity just became free. 
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Demands from users have suggested our future 
directions. First, as users build up larger and larger 
recordsets, they are asking us to provide more and 
more facilities for managing and reorganizing this data 
over time. For example, as you grow a cluster, you'll 
want to buy the latest, fastest hardware, rather than the 
same model as when you started. Thus, we’ve recently 
added a way for the LMS to automatically detect per- 
formance differences between machines in the cluster, 
and load balance between them. Only, unlike with 
webservers, load balancing parallel SQL requests is 
quite complex, and is beyond the scope of this paper. 


Second, users have started ‘faking out” the 
LMS by replicating the files by hand among multiple 
LMS clusters. While this works to some extent, we 
can imagine many features that would facilitate dis- 
tributed, poly-clusters, with (partially) replicated data. 
Again, this is beyond the scope of this paper. 
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who helped pay the bills and keep our spirits up 
through the dark days of 2001. 


Software Availability 


The Addamark LMS is a commercial software 
package available today, with introductory pricing 
starting around $75,000 for a complete package. 
Addamark also offers professional services and sup- 
port. For more information, please see our website at 
http://www.addamark.com/. 
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ABSTRACT 


System administration has become an increasingly important function, with the fundamental 
task being the inspection of computer log-files. It is not, however, easy to perform such tasks for two 
reasons. One is the high recognition load of log contents due to the massive amount of textual data. It 
is a tedious, time-consuming and often error-prone task to read through them. The other problem is 
the difficulty in extracting unusual messages from the log. If an administrator does not have the 
knowledge or experience, he or she cannot readily recognize unusual log messages. To help address 
these issues, we have developed a highly interactive visual log browser called ““MieLog.” MieLog 
uses two techniques for manual log inspection tasks: information visualization and statistical analysis. 
Information visualization is helpful in reducing the recognition load because it provides an alternative 
method of interpreting textual information without reading. Statistical analysis enables the extraction 
of unusual log messages without domain specific knowledge. We will give three examples that 
illustrate the ability of the MieLog system to isolate unusual messages more easily than before. 


Introduction 


Administration of computers has become more 
important than ever because of the increasing role of 
computers and networks in providing various services 
to our daily life. It is therefore necessary to conduct 
them continuously and properly as part of our modern 
infrastructure. 


Computer log inspections are the most funda- 
mental tasks in administration, since most of the 
events occurring in computers and networks are 
recorded into log-files. Administrators, therefore, must 
inspect them periodically. When they find an anomaly 
in the log-file, they must make an appropriate 
response as soon as possible. Today’s security threats 
to a computer network increase the importance of log 
inspections to help detect possible breaches. 


Although administrators recognize the impor- 
tance of log inspection, the task is often not performed 
regularly at many computer sites. One reason is that it 
is a tedious and time-consuming task due to the large 
amount of textual data. Another reason is that it 
requires skilled knowledge to recognize an unusual 
message in the log-files. 


We have developed a highly interactive log 
browser, called ‘“MieLog,” which uses information 
visualization and statistical analysis to help alleviate 
some of the problems involved in log monitoring. The 
purpose of the system is to assist administrators to 
inspect computer logs manually. MieLog consists of 
three main approaches. One is information visualization 
to improve the recognition load of textual data. Another 
is a high level of interactivity which makes it easier to 
filter out or extract information from log data. The last 
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is statistical analysis which provides inspectors with var- 
ious tools to help detect unusual messages in the log. 


This paper is organized as follows: First, we 
mention the issues and importance of computer log 
inspections. The next section presents the system 
overview and the detail of each module of MieLog. 
Subsequently, we explain the visualization method of 
computer logs and interactive functions. then we show 
some examples of computer log inspections using 
MieLog. Finally, we discuss related work and pro- 
posed future enhancements. 


Problems of Computer Log Inspections 


There is no doubt that log inspections are indis- 
pensable for computer administration. Administrators, 
however, regard them as tedious, time-consuming and 
often unrewarding tasks. Therefore, although some 
administrators are aware of the importance of the 
tasks, they hesitate to perform them. Indeed, a recent 
security survey in Japan shows that such tasks have 
not been performed sufficiently even by Internet ser- 
vice providers. 


We can define log inspections in more specific 
terms: 
1. Administrators retrieve a log and analyze their 
contents by reading through the messages. 
2. Administrators extract unusual messages from 
the log. 


We will consider each of these problems in fur- 
ther detail. The factors that make it difficult to inter- 
pret the log contents are that: 

e Log messages are recorded as text. 
¢ Logs usually contain a huge amount of data. 
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¢ Logs have various kinds of formats and con- 
tent. 

These factors clarify the problems that adminis- 
trators must face. Administrators must read through 
the log messages to understand them and they must 
spend many hours doing so. Thus, it is almost impos- 
sible to manually inspect computer logs at a large 
computer site. Administrators also require specialized 
knowledge about the logs, since recording formats, 
contents and existing directories of each log may be 
completely different. There are many problems with 
just the log recognition stage in log inspection tasks. 


Next, we list the factors that make it difficult to 
isolate unusual log messages. 

e Log messages resulting from a problem or 
intrusion may be small and buried among other 
unimportant messages. 

¢ It is difficult to build rules for automatic extrac- 
tion of all unusual messages. 
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e There are cases when the administrator cannot 
determine whether the log message results from 
an abnormal event or not. 


Computer log-files contain various kinds of mes- 
sages. They contain not only errors and warnings but 
also operating system status and notice from applica- 
tions. In general, they contain only a few important 
messages, while the others are less important mes- 
sages. Administrators, therefore, must be able to 
extract the important messages from the log. 


The next problem is that it is difficult to formu- 
late the rules for unusual log message extraction. 
There are two reasons. One is that no administrator 
knows what constitutes unusual messages for all 
cases. The other is that the rules are highly dependent 
on both administrator’s knowledge and the environ- 
ment of the site. We confer that it is important to build 
rules for extracting known problem log messages. We 
believe that it is also important to inspect the logs 
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Figure 1: A display image of MieLog. 
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periodically to find unknown unusual log messages 
and rebuild or refine extraction rules based on the new 
information. 


The last problem is that administrators cannot 
judge the importance of each log message based solely 
on one kind of log-file. Because each log message 
contains only partial information about an event that 
has occurred in a computer, a more reliable judgment 
requires log messages from multiple log-files. It is 
therefore necessary to collect other related information 
from various log-files and analyze them comprehen- 
sively. This would require many operations, as well as 
extensive knowledge and time. 


In this paper, we propose the use of information 
visualization and statistical analysis to address the 
above problems. 


In general, the number of unusual log messages is 
small in typical log-files. If we obtain frequency infor- 
mation from a log using statistical analysis, it is possible 
to isolate such log messages. It helps administrators to 
find truly unusual log messages. Furthermore, MieLog 
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visualizes frequency information and the log file itself 
as a figure. It reduces the recognition load of log mes- 
sages when inspecting them. The reason is that the 
method of log message recognition changes from “‘read- 
ing” to “looking.” 

MieLog does not just visualize a log as a figure. 
It also adds a high level of interactivity. Many of the 
interactive functions help perform filtering of log mes- 
sages in various ways. An inspector can execute com- 
mands by direct interaction with a visualized figure. 
The combination of the two features makes it possible 
to reduce the problems of inspecting logs by humans. 


MieLog: System Overview and its Visualization 


We developed an interactive log information 
browser called “MieLog” based on the considerations 
presented in the previous section. In this section, we 
describe the system modules of MieLog and the fea- 
tures which address the above mentioned problems. 


We used C++ programming language and 
OpenGL library in development of MieLog. The visual 
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Figure 2: A process module overview of MieLog. 
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GLF has the same format at the SYSLOG format, 
if tag1 is the hostname and tag2 is the program name. 
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Figure 3: Specification of general log format and conversion example. 


connect from someone.else.net 
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screen of MieLog is shown in Figure 1. The screen is 
composed of four visualization areas. They are the “Tag 
area,” the “Time area,” the “Outline area” and the 
‘Message area’’ respectively in order from left to right 
of the screen. The left three areas visualize different 
“characteristics” of the log, while the fourth area is for 
viewing the actual message text. MieLog also visualizes 
the relationship between each area clearly. 


The system modules of MieLog are illustrated in 
Figure 2. MieLog is composed of three modules. We 
explain the details of each module in the following 
sections. 


Log Conversion Module 


This module converts logs with various record- 
ing formats into an intermediate format called the 
“Generalized Log Format” (GLF). 


There are various problems involved in inspect- 
ing a log. One is the different recording formats used 
and the types of contents of each log message. The 
other is that administrators must have extensive 
knowledge about the log: where the log-file exists, 
how to get the log messages and which log-files 
should be inspected. To simplify these problems, we 
provide two types of tools. One is the log collection 
and conversion tool. The other tool is for merging 
converted logs. Figure 3 shows the syntax of the 
“Generalized Log Format” and a conversion example. 


As you can see, the General Log Format consists 
of four elements. The type of each element is a charac- 
ter string except for the time element, which is an inte- 
ger value in seconds. This format is very similar to the 
message format of the syslog daemon on UNIX sys- 
tems. 


This module contributes to a couple of advan- 
tages. MieLog has the ability to browse through log 
messages recorded in several log-files at one time 
because the conversion of log message enables the 
integration of various computer logs into one. The 
integration of logs are based on the recorded time 
stamp of each message. These functions reduce the 
number of operations and time involved when admin- 
istrators have to inspect several logs. Moreover, this 
reduces the difficulty of the comprehensive judgment 
because an inspector receives the time correlation of 
log messages. 


The other role of this module is pre-processing 
for statistical analysis in order to extract frequency 
information from the log. 


Frequency Information Extraction Module 


This module extracts frequency information from 
GLF formatted log messages using statistical analysis. 
MieLog uses this information to help extract unusual 
log messages without pre-defined keywords. This 
approach is based on the following concept: even if a 
log has a massive amount of message data, there are 
generally only a few key messages. In other words, 


136 


Takada & Koike 


using such information, we can extract at least the 
“candidates” of unusual log messages. This assists an 
inspector in recognizing anomalous messages even if 
administrators have no prior knowledge or experience 
about the log. 


We explain how to extract frequency information 
with respect to each element in the GLF as follows: 

e Frequency information regarding the time. 
There are two types of frequency information 
extracted from the time element of the GLF. 
One is the number of log messages that occur 
in each unit of time in a periodical time span. 
The other is the number of log messages in 
each unit of time for the entire period of the 
log. 

e Frequency information regarding the tag. 

This module counts the number of appearances 
of each tag and keeps them sorted in descend- 
ing order. 
Frequency information regarding the mes- 
sage.. We focus on a word and a phrase in log 
messages as a unit of the analysis. A phrase in 
MieLog is defined as a series of two words in a 
message. The module counts the number of 
appearances of them (Figure 4). 


Appearance Frequency in Words 


Message inforamation Num. of 
from GLF formatted log Appearance Word 


connect from tokyo 12 connect 
connect from tokyo 12 from 
connect from osaka 5 osaka 
connect from kyoto 4 tokyo 
connect from osaka 2 kyoto 

1 hakata 


connect from hakata 
connect from osaka 
connect from osaka 
connect from kyoto 
connect from tokyo 
connect from osaka ae 


Appearance Frequency in Phrases 
Num. of 
Appearance Phrase 
connect from 


connect from tokyo 5 from osaka 
4 from tokyo 
2 from kyoto 
1 from hakata 


Figure 4: Feature extraction of messages. 


Using MieLog, it is also possible to extract 
unusual log messages using keywords that an inspec- 
tor already knows. When an inspector defines key- 
words, MieLog highlights these words or phrases 
visually. 


Information Visualization Module 


MieLog visualizes log messages by combining 
three kinds of sources: GLF-formatted log messages, 
frequency information and pre-defined keywords. This 
module also makes MieLog a highly interactive sys- 
tem. MieLog has a variety of interactive functions to 
help extract unusual messages. Visualization provides 
the inspectors with the following two advantages: One 
is to reduce the load on recognizing a textual message. 
The other is that it enables administrators to introduce 
human decision making into judging whether each 
message seems to be unusual or not. 
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Visual Representations and Interactive Functions 
In this section, we describe the visualization 
method and interactive functions of MieLog. 
Visual Representation of MieLog 


The information visualization module creates a 
visual screen such the one shown in Figure |. The 
screen of MieLog is composed of four visual areas as 
mentioned in the previous section. We will explain the 
visual representation of each area in this section. 

Tag area 

The tag area visualizes frequency information of 
tags as a vertical grid (Figure 5). Each colored tile in the 
grid represents a corresponding tag information. The 
number of tiles represents the total number of tags. 


Bl Maximum Output 
A “€! Number of Tag 


A Minimum Output 
Red | Number of Tag 


Different tiles represent different tags. 
The color of the tiles represents 
the number of times the tags appeared. 


Figure 5: Visualization method of tags in log. 
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The color of the tiles represents the value of fre- 
quency information of each tag. A blue tile indicates 
that the corresponding tag has the highest frequency 
value in the log, while a red tile indicates that the cor- 
responding tag has the lowest frequency value. Other 
tiles with intermediate frequency values have interme- 
diate colors between red and blue. This visualization 
makes it possible to understand the number and fre- 
quency of each tag. The name of each tag is displayed 
at the bottom of the screen. 


MieLog uses another coloring scheme in this 
area based on the number of tiles, namely the number 
of tags. The color of each tile is evenly gradated from 
blue to red. It is easier to distinguish each tile than the 
colors based on the frequency values. 


Time area 


The time area is subdivided into three areas. The 
right-most column of the time area shows a histogram. 
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The time is assigned from top to bottom, and the value 
axis is assigned from left to right. It shows how many 
messages area produced in each unit span. The two 
left columns in the time area represent the appearance 
frequency information in different periodic time divi- 
sions. As you can see, the left grid has seven tiles and 
the right has twenty four tiles. This indicates that the 
left grid represents the appearance frequency informa- 
tion in a week and the right represents them in a day. 
The representation method of these grids is the same 
as that of the tag area except for the coloring. The col- 
oring of this area is a gradation between white and 
black instead of blue and red (Figure 6). 


These visual representations make it easier to 
recognize time-characteristics of the log. 
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Time information messages 3 7 9 
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Oct 25 14 
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Oct 25 10:55 


Time period is an hour in this example 


Figure 6: Visualization method of time in logs. 


Outline area 


This area displays the outline of log messages. 
Each log message is represented as a colored line. The 
length of the line is the string length of the log mes- 
sage. The colors of the lines are the same as the grid’s 
colors in the tag area. In other words, the color of each 
line is assigned to the color of the corresponding tag 
defined in the tag area (Figure 7). 


This visual representation enables administrators 
to recognize log messages as a visual pattern based on 
the length and the frequency with which they appear. 
As a result, it is possible to browse many log messages 





The outline area visualizes the log messages 
as colored lines corresponding to the message length. 


Figure 7: Visualization method of outlines of log messages. 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


137 


MieLog: A Highly Interactive Visual Log Browser ... 


at once, unlike textual representations. Since the line 
color depends on an appearance frequency of the tag, 
administrators can also judge whether each message 
appears to be unusual or not. These provide inspectors 
with the opportunity to pinpoint unusual log messages 
before they read the textual log messages. 


In the center of this area, a transparent square 
exists, representing the correlating section between the 
outline area and the message area. The region of the 
highlighted square in the outline area is the section 
displayed in the message area. 


Message area 


The message area represents actual textual log 
messages. It is a view similar to that of a text editor, 
with the exception that it has highlighted words or 
phrases. Words and phrases highlighted in red and 
blue (Figure 8). The words and phrases highlighted 
with red represent the keywords specified in the pre- 
defined keywords. Those highlighted in blue represent 
words with a low appearance frequency value. An 
inspector must define the threshold value manually if 
he or she wants to extract the words and phrases with 
a low appearance frequency value. These features 
make it possible to extract not only known key mes- 
sages but also potentially suspect messages. 


Setting hostname he 
Ee yaa an TE Me uss lira Nella e 





These words might be valuable for inspectors. 


RED highlights pre-defined keywords. 
BLUE highlights words which appear with low frequency. 


Figure 8: Visualization method of log messages and 
its features. 


Interactive Functions 


MieLog has a variety of interactive functions that 
perform various filtering of log messages. Using these 
functions, administrators can extract log messages 
using various visual transformations. This capability 
effectively assists inspectors by extracting the mes- 
sages that meet a specific pattern. In this section, we 
describe the interactive functions of MieLog. We 
explain them with respect to each visual area. 


Tag area 


An interactive function in the tag area allows the 
extraction of log messages with a specific tag. If 
administrators want to filter log messages using tag 
information, they simply specify their focused tag by 
clicking the tile in the grid with the mouse. They then 
obtain a new visual screen that displays only the log 
messages with the specified tag. 


It is possible for inspectors to specify not only 
one tag but also multiple tags in the filtering. It is also 
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possible to specify the tags based on appearance fre- 
quency information. 


Time area 


An interactive function in the time area extracts 
the log messages based on their recorded time. There 
are two types of visualization methods in this area: 
grid visualization and histogram visualization. We 
explain the interactive function of each of them 
respectively. 


In grid visualization, an inspector can extract log 
messages based on two types of periodical time spans. 
They are the hour of the day and the days of the week. 
The method of filtering, as in the tag area, is to click 
the tile with the mouse. The administrator can specify 
multiple tags in a grid. They can also specify multiple 
tags in two separate grids. In such a case, MieLog 
extracts the log messages based on the “AND” condi- 
tion in each time span. In other words, it is possible to 
extract the log messages that were recorded in 18, 19 
and 20 o’clock on Saturday and Sunday just by click- 
ing the five tiles. 


In histogram visualization, an inspector can 
extract the log messages based on the number of log 
messages in each time span. The filtering method is 
described as the following. First, the inspectors should 
select the type of the filtering. There are three filtering 
conditions: ‘“‘less than,” ‘nearly equal,” and “more 
than” a threshold value based on the number of log 
messages in each time span. Next, the inspector should 
define a threshold value. A vertical line is drawn when 
the inspector drags the mouse pointer by pushing the 
right button in the histogram area. That line represents 
the threshold value for filtering. The inspector can 
define the threshold value interactively using the visual 
representation. Finally, the inspector releases the right 
mouse button to fix the threshold value according to the 
location of the mouse pointer. The filtering process, 
then, starts running using the threshold value and the 
previously defined filtering mode. A visual representa- 
tion reflects the filtered result. 


Outline area 


An interactive function in the outline area 
enables inspectors to extract log messages based on 
the length of the log message. 


Whenever the inspector defines a base length for 
filtering by manipulating a mouse, they obtain a new 
visual screen that displays only the messages with a 
certain length. There are three filtering conditions. The 
first filtering condition extracts messages shorter than 
the base length. The second condition extracts mes- 
sages which are nearly equal to the base length. The 
third condition extracts messages longer than the base 
length. The method of filtering is the same as the fil- 
tering method based on the output number in the his- 
togram of the time area. 


The outline area has another interactive function 
which enables direct access to the specific log 
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message for a detailed look, displaying it in the mes- 
sage area. The number of visualized log messages is 
much greater in the outline area than in the message 
area. Therefore, many log messages visualized in the 
outline area are not visible in the message area. If 
unusual log messages seem to exist in the log mes- 
sages, it is natural that administrators would want to 
know the details of them. This interactive function 
helps them to inspect such log messages more easily. 


Message area 


An interactive function in the message area 
extracts log messages that include specific words or 
phrases. In other words, it is possible to filter log mes- 
sages by a word or a phrase. The method is described 
as follows. 


First, the inspector should choose words or 
phrases in the menu. This operation is to decide the 
unit element for filtering. Second, a filtering condition 
should be selected. There are three filtering condi- 
tions: “‘and,” “‘or,”’ and “‘not.”’ These filtering condi- 
tions represent the logical relation between selected 
words or phrases. Filtering with “and” condition 
extracts the log messages that include all selected 
words or phrases. Filtering with “or” condition 
extracts the log messages that include at least one of 
the selected words or phrases. The above two filtering 
conditions become effective when an inspector selects 
more than one word or phrase. Filtering with “not” 
condition extracts the log messages that do not include 
the selected words or phrases. It is also possible for an 
inspector to use this filtering condition when he or she 
selects only one word or phrase. 


The “not” filtering is extremely useful to reduce 
the amount of visualized messages because it enables 
the inspector to erase some messages from the inspec- 
tion target. Using this filtering, the administrators fil- 
ter out the well-known (i.e., useless) log messages step 
by step. This function assists the administrator to nar- 
row the inspection target easily and interactively. 


There are other kinds of interactive functions that 
are not closely related to other visual areas. 


Defining Keywords and Key Phrases 


When the inspector defines keywords or key 
phrases that are already known as unusual log mes- 
sages, MieLog highlights them in a manner that the 
inspector can easily recognize the existence of them. 
The definition of keywords and key phrases is usually 
done using another tool such as a text editor. It is, 
however, possible to define them using MieLog itself 
through the GUI. 


There is two methods of defining keywords or 
key phrases in MieLog. One is to input them through 
the GUI. The other method is described as follows. 
First, the inspector selects a word or a phrase as a key- 
word or a key phrase on the screen of MieLog using 
the mouse. Next, he or she starts running the GUI for 
keyword definition. Then, selected words or phrases 
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are automatically input in the GUI. If the inspector 
pushes the “OK” button on the GUI, they are defined 
as a keywords or key phrases. Namely, the latter 
method provides a convenient method of input of key- 
words or key phrases for the inspectors. 


Continuous output of 
a certain amount of 
log messages 






There are hardly any 
log message for a time 


ea 


[i 


These two tiles show 

that many log messages 
were outputted 

in Thursday and 17 o'cloc 





Spikes of log messages 
were output at certain 
time periods 





Figure 9: An example of an investigation focusing on 
time. 


Summarizing Log Messages 

MieLog has a log summarization function that 
eliminates the duplication of log messages. Computer 
log-files generally contain massive amounts of mes- 
sages. One of the reasons is that relatively unimportant 
messages are recorded repeatedly in the log. These 
messages are usually a result of a proper event of the 
operating system or applications. MieLog, therefore, 
enables the inspector to summarize log messages 
interactively in order to reduce the inspection target. If 
the inspector executes this function, all visualized log 
messages in MieLog become unique. This makes it 
possible to reduce the number of log messages and 
avoid redundant log inspection. 


Log Inspection Examples Using MieLog 


In this section, we show three examples which 
demonstrate how to find log messages that seem to 
show an abnormal behavior using MieLog. We also 
explain how to use the interactive functions for effec- 
tive browsing in each case. 


An Inspection Example using Log Recording Time 
Visualization 


Figure 9 shows an example of visual representa- 
tion of the time area. 
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There are two grids in the time area. Each grid 
has one bright white tile. This visualization shows that 
many log messages were recorded at a certain hour of 
the day and a certain day of the week. In this example, 
these are 17 o’clock and Thursday. This is a notable 
indication to help find an unusual log message. Such 
an indication would be regarded as an abnormal event 
in many cases. occurrences in many cases. The admin- 
istrators, therefore, should inspect the log messages 
recorded in that period of time. It is easy for them to 
view only the log messages recorded in those periods 
of time, if they make use of the interactive log mes- 
sage filtering by time. They would simply click the 
two white tiles with the mouse. 


Next, we look at the visualization of an output 
trend as a histogram in the time area. From the exam- 
ple diagram, it is possible to recognize certain notable 
activity as listed below: 

1. Up until a certain time, a regular number of log 
messages were continuously recorded in each 
time span. 

2. After this period, no log messages were 
recorded. This situation continued for a while. 

3. Some time later, there are two time spans that 
recorded large spikes of log messages. 

It is possible for the inspectors to recognize all of 
these indications without actually reading through the 
log messages. 


We propose that there are three specific areas 
that the administrator should inspect the log in further 
detail based on the above indications. One is the time 
when message output was lost. The others are the two 
time spans when a large number of log messages was 
recorded. 


It is easy to perform these inspections using the 
interactive function of MieLog. Administrators can 
easily access the log messages that were recorded in a 
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specific time span just by clicking the lines in the his- 
togram of the time area. They will get a new visual 
screen that shows the messages recorded in the spe- 
cific time span only. 


In this example, as a result of the above indica- 
tions, the inspectors was able to determine the follow- 
ing things: The reason why log messages ceased after 
a certain time is that a system program did not start 
after the system configuration was modified. And the 
massive number of log messages generated in specific 
time spans were a result of running examination of 
new software by an another administrator. In this case, 
both indications did not result from an abnormal 
event. 


We have shown an example in which MieLog 
was able to represent various indications to the inspec- 
tor just by focusing on the time area. These indications 
assisted the inspector to find unusual messages 
through time trends and frequency, which would have 
been almost impossible to discover reading through 
conventional text. 


An Inspection Example using Log Outline Visual- 
ization 
Next, we focus on the visual representation of 
the outline area. Figure 10 shows three visual repre- 
sentations of the outline area. The left visualization 
seems to have been made during a normal status. The 
other visualizations seem to indicate an abnormality. 


Focusing on the left visualization, There are 
three characteristics which can be recognized as fol- 
lows: 

1. Most of the log messages have nearly same 
length. 

2. There are the same series of log messages out- 
put towards the middle and the bottom of the 
visualization. 
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Figure 10: An example of an investigation focusing on log message outlines. 
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3. There is a message that is clearly longer than 
the others towards the bottom of the screen. 


glance. The reason is that the log messages in this out- 
line example have various lengths of lines with red 
color. The inspector knows that these particular log 
messages rarely appear in this log-file. The inspectors 
should investigate them in further detail. 


An administrator should suspect the latter two of 
the three as an indication of unusual log messages and 
inspect them in detail. To look at the log messages in 
detail, the inspector clicks the line in the outline area 
with the mouse and the text will appear in the message 
area. if the inspector decides there is something 
unusual about the messages, he or she needs to inves- 
tigate the log from the following points of view: 

e Were a series of messages being recorded at 
regular intervals? 

e At what time do their messages begin to 
record? 

e Are there any other unusual messages around 
the time when it started recording their mes- 
sages? 


We finally focus on the right image in Figure 10. 
This visualization contains many lines with the same 
blue color and the same length towards the bottom. 
This is absolutely unusual status. The line colors are 
blue and therefore, appear to be a normal status. 


We think, however, that the reason why the line 
colors are blue is because the repeated output of the 
same message makes its appearance frequency high. 
The inspector should investigate these messages in 
further detail. 


As seen in these examples, outline visualization 
enables the inspector to recognize the log messages as 
a pattern. This feature provides indications to find an 
unusual message before reading the textual messages. 
In other words, outline visualization provides another 
method for extracting unusual log messages other than 
the frequency information data. The above examples 
give a glimpse of such ability. This capability greatly 
depends on using information visualization and intro- 


It is also easy to answer the above questions 
using interactive functions. The inspector can extract 
the series of messages using word filtering. The 
inspector can then easily get the output trend and peri- 
odicity of such messages from the time area. He or 
she, of course, can easily access each message pattern 
and its surrounding messages. These functions help 
the inspector to look for unusual message around that 


time period. 


Next, we look at the center image in Figure 10. 
This figure is clearly different from the log messages 
outlined in the other two visualizations. The inspector, 
therefore, easily recognizes an abnormal status at a 


ducing a human decision into the judgment. 


An Inspection Example Using Log Message Repre- 
sentation with Word and Phrase Highlighting 


Finally, we focus on the visual representation of the 
message area. MieLog represents log messages as text. 
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Figure 11: An example of an investigation focusing on log messages. 
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However, more than just text, MieLog uses highlighting 
features, which makes it possible to quickly recognize 
words and phrases with low appearance frequency. 


Figure 11 gives three examples of visual repre- 
sentations of the message area. The difference 
between these visualizations is the method of high- 
lighting the words and phrases for indicating the sus- 
picious log messages. The left visualization is an 
example using keyword extraction only. There is one 
word highlighted in red. It means that the word is one 
of the keywords pre-defined by the inspector. 


The other two visualizations are examples using 
keyword and frequency extraction. There are some 
words colored blue. It means that they are low appear- 
ance frequency words. In the center visualization of 
Figure 11, the inspector sets the threshold value for low 
frequency extraction to low. In the right visualization, 
the inspector sets the threshold value to middle. There 
are, of course, many more highlighted words with blue 
color in the right visualization than the center. 


If the inspector wants to know the low frequency 
word, he or she must define the threshold value first. 
In the current implementation, the inspector must 
select the threshold value from fixed values using pop- 
up menu. Then, the inspector clearly recognizes the 
words and the phrases with low appearance frequency 
because they are highlighted in blue. 


We believe that indicating low appearance fre- 
quency words and phrases helps administrators find 
unusual log messages. Such words and phrases are 
probably related to unusual log messages. Visual repre- 
sentations reduce the chance of overlooking them in 
manual log inspections and makes it easier to recognize 
low frequency words and phrases where they appear. 


Discussion 


We describe the advantages and proposed future 
work of MieLog in this section. 


Advantages of MieLog 


MieLog provides administrators with various 
tools for inspection using information visualization 
and statistical analysis. They help administrators to 
look for unusual log messages. MieLog also makes it 
possible to inspect logs interactively. The advantages 
of MieLog are as follows: 

¢ Logs in various formats simultaneously.. The 
data analyzed in MieLog consists of logs that 
were converted into an intermediate format. If 
there is a log we want to inspect with MieLog, 
we must first convert it. We can inspect any log 
using MieLog if the conversion process is pro- 
vided. It also enables the inspector to inspect 
more than one log at a time. In other words, we 
can integrate multiple logs into one, based on 
the recorded time. It reduces the time and oper- 
ations involved in inspecting multiple logs. 
Moreover, such log integration shows the 
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relationship between the messages that were 
recorded in separate log-files. 

¢ Methods which assist in finding unusual log 
messages. MieLog extracts appearance fre- 
quency information from the log in various 
graphic visualizations. Their information makes it 
possible to provide various indications of abnor- 
mal events which may be almost impossible to 
find by administrators reading the textual records. 
They also allow the inspection of logs from a 
global point of view. The inspector, for example, 
can decide whether the log message is unusual or 
not base on the number of log messages from a 
specific program. No prior operations and knowl- 
edge are needed in this process. Even if the 
inspector has no knowledge and experiences, he 
or she makes use of this advantage. 

e Visual Representation and Interactive Func- 

tions. Even if abnormal indications are 
extracted through statistical analysis, it does not 
make sense that the inspectors can not recog- 
nize them. To resolve this problem, MieLog 
represents their indications visually in order to 
recognize them easily and quickly. 
Using information visualization makes it possi- 
ble to interact with the visualized information 
directly. Using interactive functions, the inspec- 
tor can easily and intuitively extract the log 
messages that fit a specific condition. We think 
that it helps to bring the human decision mak- 
ing process to the log inspection task. 


We put above advantages in another way. 


The greatest advantage of MieLog is to provide a 
method of anomaly detection for manual log inspection 
task. An anomaly detection is well-known technique for 
detecting intrusive behavior in intrusion detection 
research. There is, however, no system that makes use 
of such technique. Other log inspection systems make 
use of misuse detection only, such as keyword search. 
Information visualization and statistical analysis make 
it possible to use such technique in inspecting computer 
log manually. From this point of view, MieLog varies 
greatly from other log inspection tools. 


One more another advantage of MieLog is that it 
is a human-centered system. MieLog is just a log 
browser, not an automated log inspection tool. And 
information visualization makes easier to recognize the 
log content than textual representation. We consider 
that there are still a lot of system administrators who 
want to inspect the log by themselves. There is, how- 
ever, no system that has functions in order to meet their 
requests. We believe that MieLog is a tool that has a 
variety of functions to meet their requests. 


Future Works 


We describe proposed future enhancements of 
MieLog in this section. There are two main areas for 
improvement. 
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First issue is the log message extraction method. 
We use the appearance frequency information for 
extracting unusual log messages. However, not all 
extracted log messages are unusual log messages. We 
must evaluate the validation between extracted mes- 
sages and unusual log messages and refine the extrac- 
tion method. For example, many numerical values in 
the log are extracted as low frequency words because 
they are not recorded repeatedly. Therefore, we should 
exclude them as targets of statistical analysis. 


The next issue is the performance related to the 
size of logs. The modules that are mainly affected by 
the size of log are statistical analysis and interactive 
functions. As the size of log increases, the more time 
is needed to process the response of the interactive 
functions. To alleviate this problem, we think it would 
be better to separate the statistical analysis from the 
graphical browser. We are also considering using a 
database. 


The last issue is to add a real-time log monitor- 
ing feature. This issue has a lot of problems. We think 
that we must modify the system design largely in 
order to implement this feature. We also consider that 
we should be prepare for the another visual expression 
method for real-time log monitoring. One reason is 
that MieLog visualizes only a small number of log 
messages in current representation method. If we use 
MieLog in real-time log monitoring, MieLog should 
have an ability to visualize a large number of log mes- 
sages more than the current because it is easily 
expected that a lot of log messages are suddenly out- 
putted at a time. If such case occurred, administrators 
lost the chance to see unusual log messages like 
above. 


In addition, we must prepare log conversion pro- 
grams for the various type of logs in order to capital- 
ize on the advantages of MieLog. We currently pro- 
vide only log conversion for UNIX syslog formats. 


Related Works 


There are a number of log inspection tools 
already in existence which can be compared with 
MieLog. In this section, we describe two typical log 
browsing and inspection systems, and explain how 
MieLog differs from these types of systems. 


One system is ‘‘Xlogmaster” [9]. This is a GUI 
based log monitoring system running on X window 
system. The main problem of Xlogmaster is that it 
represents log messages as text and it is harder for 
inspectors to recognize the log messages. Moreover, it 
is almost impossible to determine characteristics in the 
log, such as a message output trend and so on. The 
inspector must define the keywords in order to extract 
the unusual log messages. Administrators who have 
no knowledge and experience of performing log 
inspection will have difficulty extracting unusual log 
messages. 
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The other system is ‘“SeeLog” [1]. SeeLog, like 
MieLog, represents log messages visually. However, 
the main problem with SeeLog is that only an outline 
visualization is available. It is thus difficult to ascer- 
tain characteristics of the log. Although SeeLog 
enables a visual representation of textual log mes- 
sages, it is difficult for administrators to browse 
through them. It also has the problem of requiring 
keyword definitions in order to extract unusual log 
messages. 


There are other log monitoring tools such as 
Swatch [5], Logsurfer [7] and syslog-ng [8]. However, 
a novice administrator will have difficulty using these 
tools effectively because the method of unusual log 
message extraction in these tools is by keyword 
search. 


There are a number of problems with using key- 
word search as the inherent method of extraction. The 
first problem is that the inspector must define the key- 
words. It is, however, difficult for some administrators 
to do this because not all administrators are aware of 
the keywords for unusual log message. The second 
problem is that it is almost impossible to extract 
unusual log message that are not widely known. The 
third problem is that extracted log messages are repre- 
sented as text and will still have a problem with the 
recognition load of the log messages. The last problem 
is that these systems do not support log inspection of 
messages around the suspect message. Administrators 
must inspect log messages manually around the time 
of the recorded extracted log message in order to find 
other related unusual log messages. They also might 
have to look for another log-files in the same purpose. 


Conclusion 


In this paper, we have described the interactive 
visual log browser, named MieLog. MieLog assists 
human inspection of computer log data. MieLog has 
three main features which address some problems of 
log inspection tasks: 

1. MieLog reduces the recognition load of the log 
by using information visualization. 

2. MieLog’s General Log Format allows the 
administrator to inspect various kinds of logs at 
one time. 

3. MieLog uses statistical analysis to extract vari- 
ous indications that might closely relate to 
unusual log messages. 


These features provide the following merits. The 
inspector can inspect more than one log at a time. It is 
also possible to find an unusual log message even if 
the inspector has no prior knowledge about them. The 
most important merit of MieLog is that it brings the 
human decision making process into the log inspection 
task. 


Future works on MieLog include its evaluation 
in a practical environment, and the refinement and 


143 


MieLog: A Highly Interactive Visual Log Browser ... 


extension of message extraction methods and interac- 
tive functions. 


Availability and Requirements 


Regrettably, MieLog is not freely available 
because we have a plan to be a commercial product. 
However, we might release a limited version of 
MieLog in future. The reason is that we have to evalu- 
ate it and collect the opinions about MieLog. 


Please feel free to contact the author by E-mail to 
zetaka@computer.org for the current status of MieLog 
or any related information. 
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ABSTRACT 


The successful operation of a large scale enterprise information system relies, in part, on the 
regular and successful completion of many different tasks. Some of these tasks may be fully 
automated, while others are done manually. One of the challenges we face is detecting when one 
of these tasks fails (often silently) or is forgotten. While you will eventually learn of these 
omissions, it is much better to have the system detect them rather than your users! This paper 
discusses how we implemented a system that watches what we do and reminds us when we (or our 


computers) forgot to do something. 


Introduction 


Inspector Gregory: “Is there any other point to 
which you would wish to draw my attention?” 

Holmes: “To the curious incident of the dog in 
the night-time.” 

“The dog did nothing in the night-time.” 

“That was the curious incident,” remarked Sher- 
lock Holmes. 

From The Adventure of Silver Blaze by Arthur 
Conan Doyle. 


At Rensselaer, we manage many of our system 
and site administration tasks! with an Oracle database. 
For example, we take a data feed from Human 
Resources to automatically create and _ expire 
Unix/email and Windows 2000/Exchange accounts. 
One aspect of this is that we have many tasks, some 
run via cron and other scheduling mechanisms, and 
others run by hand on a regular basis. These tasks gen- 
erate configuration files [7], [3], web pages (phone 
directory) [5], process accounting and billing records, 
update the Active Directory server, and many other 
things. 

One of the problems that we face is knowing 
when something that is supposed to happen did not. 
This may be due to a transient file server failure, con- 
figuration problems, the failure of a daemon, or sim- 
ply someone forgetting to do some periodic, yet infre- 
quent task. There are a number of monitoring and log- 
ging tools available, from those built into systems 
such as syslog and programs to help process logs such 
as Swatch [10]. There are also tools that monitor net- 
work traffic and system activity such as Peep [9] and 
others. In general, however, all of these systems are 
looking for things that are happening but, like Sher- 
lock Holmes, we are interested in those things that did 


1Originally, I was calling tasks “processes,” but this was 
causing some confusion on the part of some readers with 
Unix (or other system) processes. There are still many refer- 
ences to “process” in table and function names, but the in- 
tention is to refer to a “task.” 
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not happen. An earlier project to monitor workstation 
usage patterns [4] briefly discussed detecting failed 
workstations by a lack of usage data, but was not pur- 
sued. Some other projects to measure system perfor- 
mance via statistical analysis [11, 2] don’t really apply 
to very low frequency events. 


One of the things that I wanted to avoid was 
writing and maintaining lots of configuration files. 
Instead, I wanted tasks to report in to a central server 
when they completed successfully, and then have a 
nice interface to identify new tasks and quickly set the 
frequency at which they should reoccur. After that, I 
don’t want to have to think about that particular task 
again. Given our heavy use of Oracle in maintaining 
our system, and that many of the things I was inter- 
ested in monitoring were already accessing Oracle, an 
obvious approach for me was to use the database for 
all of the heavy lifting. 


Task Monitor 


With that, the Task Monitor project was born. 
When I use the term “‘process,” I am not referring to a 
Unix process, but rather a specific task such as “‘load 
printer accounting records,” “update online directory 
files,” “propagate password changes to Windows,” 
[8], etc. These may actually be an Oracle job, or part 
of a job, or a script run out of cron, or even something 
running on a Windows server. 


The information on a task is stored in an Oracle 
table with the name Process_Monitor. The description 
of this table is broken up into several parts and is 
included in the appropriate section of the paper. In 
Figure 1, we have the rough architecture of the Task 
Monitor system. At the center of things is the 
Task_Monitor package, which acts as an interface 
between the different tasks and the Process_Monitor 
database table (labeled Task_Monitor in the diagram). 
Tasks communicate via a number of different meth- 
ods. Some, such as the Student Upd package, are run- 
ning on the oracle server and communicate directly. 
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Others, such as the Generate_File based modules, con- 
nect via SQL*NET. We may also add other interfaces 
such as syslog or SNMP. 





Generate File 
Program 





Figure 1: Task monitor architecture. 


We also have different ways of getting informa- 
tion out of the Task Monitor system. A program on the 
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database machine generates email notifications and 
sends them to interested parties. We also connect via a 
secure web server for administrative purposes. 


Administrative Interface 


One of the key parts of this system is the admin- 
istrative interface, which allows us to set the options 
for each task. This is implemented via a secure web 
server. 


In Figure two, we have a screen capture of the 
main web page for the Task Monitor system. This 
allows you to display different sets of tasks. You select 
the things you are looking for, and press the “LIST” 
button. All of the attributes are combined, so the more 
you select, the more restricted the selection. The first 
option is to find late or ‘‘not late” tasks. Next, you can 
select the family from the pull down list. There is a 
also a special “NONE” entry that will limit the results 
to those tasks that are not in a family. You can also 
select based on those tasks with or without a run delta 
or schedule, and those tasks that are marked as inac- 
tive. Finally, you can restrict the tasks to just those 
owned by you. (This is run as an administrator; less 
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Figure 2: Main web page. 
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privileged users will just get their own tasks.) A sam- 
ple of this list can be seen in Figure seven. 


In Figure three, we have a sample web page of 
the ‘‘Logins-Oracle IDs” task. The objective of this 
task is to create Oracle accounts based on changes in 
the Logins table. This task is considered part of the 
“daily run,” a set of activities performed by our User 
Services staff. From this page, we could move the task 
to another family using the pull down list, or create a 
new family by entering the name in the “New Fam- 
ily” box. (This box does not appear if you are not an 
administrator; you are limited to existing families.) In 
this case, we don’t care what system this task runs on, 
only that it is run; so we leave the “System” box 
empty. The person who normally does this task is Judy 
Shea, so we have listed her as the owner. If we wanted 
to let other folks know if this task was late, we could 
provide a list of email addresses in the “Contact List” 
box. The next thing we can specify is the “Run 
Delta,”’ which is specified as DAYS HOURS:MINS:SEC. 
In this case, we want this to be run every 28 hours 
(one day and four hours). This gives Judy a little bit of 
flexibility in when she does the actual run. The 
“Notify Delta” is like specified like the “‘Run Delta” 
and controls how frequently we report a missed task. 
Lastly, we can mark as task as inactive, which turns 
off all notification. 


The next part of the page reports on information 
collected at the last run. The ‘““Next Run” is the time 


Logins-Oracle IDs 


~ Name ‘Logms-Oracle IDs 
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and date when we next expect this task to be run. If 
that time was passed, this would be in bold face and 
be marked as late. (This is only set if there is a “Run 
Delta’’ set.) The “‘Last Run,” which is always avail- 
able lists the time and date of the most recent run. Cur- 
rently, the only way to get a task into the system is to 
run it, so there is always a “Last Run” entry. Next up 
is when we last notified someone about a late process, 
and when we expect to send the next notification 
(assuming a run has not been completed.) There is 
also space for a free format comment on the task. 


When a run is recorded, the system attempts to 
capture the host OS username and the hostname. 
There is also an option when recording a task to 
include a comment; this is task specific. We also 
record the Oracle user and what Oracle package made 
the call. 


Identifying a Task 


A lot of the tasks we are interested in monitoring 
are site wide. For example, the process that regener- 
ates the directory web pages only needs to run on a 
single system and write a file into our central file 
server. There is no need to run it on each of our pro- 
duction web servers, as they all use the same central 
file server. In other cases, however, we want to be sure 
that each server is reporting in. An example of this 
would be the process that collects the printer account- 
ing logs. We want to ensure that each print server is 
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Figure 3: Sample task web page. 
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reporting its activity on a regular basis. It may also be 
useful to identify the user that is running the process. 


For the purposes of this project we identify a pro- 
cess by a process name and the system on which it 
runs. In this way, if the same process runs on two dif- 
ferent machines, it will be considered two different 
tasks for the purposes of monitoring. This has not been 
a problem with the site wide tasks, as they are generally 
run via cron or some other trigger on a single desig- 
nated machine and by the same user (daemon or equiva- 
lent). Since most of the testing and development takes 
place on a different machine, this isolates the develop- 
ment and test runs from the production runs. We also 
record the name of the person who ran the task, but this 
is currently not used to distinguish tasks. 


Parenting a Process 


In order to help with sorting and grouping, each 
process can be assigned to a “Family.” These are gen- 
eral categories such as “Accounting,” ‘‘Daily Run,” 
“File Gen,” etc. When a new process is entered, it 
will not have a family assigned to it. This works well, 
as the administrative web tool can display all tasks in 


Entry_Id Number 


Name varchar2 32 
System_Id 


Number 
database. 
Run_Host — varchar2 128 
be determined. 


Run_User  varchar2 32 
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a family, or those without a family. This provides a 
quick and easy way to identify new tasks. 


When we encounter a new process, we assign it 
an owner and a family. We can also link it to a service 
in our ServiceTrak [6] so a the page displaying service 
information can also include details on some of these 
tasks. Once a process has an owner, that owner can 
use the same web tool to finish the setup by assigning 
a run delta or schedule, or just marking it as inactive. 


Reporting a Task 


The first challenge of this project was to find 
ways for tasks to report that they ran. When a process 
reports in via one of the methods described below, we 
first see if we have an existing process record. If not, 
we create a new one; otherwise, we update some of 
the information and, if there is a run schedule or delta, 
we then calculate the next run time and save the 
record. If there were previous error conditions, we 
clear them as well. 


Direct PL/SQL Procedure Call 


A number of the tasks that we want to watch are 
written entirely in PL/SQL and are run on the database 


Description 


A unique key to identify this record. 

The name or external identifier for this process. 

The unique identifier of this system in the hostmaster and service 

The hostname where this last ran. Useful when the System_Id can not 


The name of the host system user who ran the process (if available). 





Table 1: Process_Monitor table — identification. 


Type Size 
Family varchar2 
Owner Number 
Service_Id Number 
supports. 


Run_Delta Number 


Description 


An identifier used to group tasks for display and reporting. 
The internal identifier of the person who ‘‘owns” this process. 
The internal identifier of the service (see ServiceTrak) that this process 


The maximum allowable time in seconds between the last run and the 


next run of this process. 


Run_Schedule 
Inactive 


varchar2 
varchar2 
nored. 


A crontab format schedule. 
A flag indicating that the current entry is inactive and should be ig- 





Table 2: Process_Monitor table — parenting. 


Oracle_User varchar2 32 
Proc_Name variable 65 
Last_Run_Time Date 


Description 


The name of the oracle user. This is always available. 
The name of the oracle procedure that logged this run. 
The time and date when this process last ran. 


The previous value. Useful in calculating the spacing between runs. 
The date and time when we next expect to see this run. This is the key 
trigger for notification. 

An optional comment set by the caller that will be displayed in status 
messages. Unlike the Error_Flag, this does not trigger notifications. 


Next_To_Last_Run_Time Date 
Next_Run_Time Date 


Run_Comment varchar2 = 255 





Table 3: Process_Monitor table — reporting. 
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machine. For example, we have a routine that we run 
daily to compare the Simon Banner_Students table with 
the student base table (SGBSTDN) on our administra- 
tive machine. This is typical of many similar routines, 
and it has two optional parameters, a Target_PIDM 
which allows us to update a record for a specific per- 
son, and a StopCount which stops the update after a set 
number of records (this is handy for debugging). 
When neither parameter is set, we want to record the 
fact that the routine ran to completion. 


In Figure four, we have a code segment of the 
routine that checks the student base table on Banner 
(our student record system) and updates the Simon 
student table. The two cursors are written so that if 
they are opened with a value for the PIDM, they will 
return just the single record for that person; otherwise, 
they will return a full set of records. There is also an 
option to stop after a set number of rows. Once the 
loop is complete, and if we did not exit due to the stop 
count, and we were not doing the check on behalf of a 
specific individual, we will record this run using the 
Process_Monitor_Record.Mark_Proc procedure. This pro- 
cedure will obtain the user, hostname, and other infor- 
mation from the database environment. The only thing 
we need to give it is the task name (Target_Name) and 
the name of the current package. In this way, when- 
ever we do the general update, it will record the fact 
that it ran; it doesn’t matter how we did it. 
Generate_File Definition 

A number of the tasks that we want to watch are 
run via our Generate_File system (described in the 
LISA 2000 Proceedings). These tasks are generally 
reading or writing files, using stored procedures in the 
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database. When we develop a new file target, we store 
the PL/SQL source code in a file, and read that into 
the database using SQL*PLUS.? At the end of this file, 
we include a block of PL/SQL to register the new tar- 
gets with the system. This is done with a procedure 
called Add_Target_Simple or Add_Target_Complex.? 


In Figure five, we have a fragment of the file 
used to generate some web pages documenting our 
network routers and subnets. In the package, we 
define some entry points that will be called by the 
Generate_File system. At the end of the sample, we 
register three things: a simple target (web_routers) that 
will call the Get_Router_Html routine to generate a list 
our our routers into the primary_routers.html file, a com- 
plex target that will generate a set of files based on 
Get_Network_List routine, and, finally, a special target, 
Add_Process_Record, that will record the fact that the 
first two entries have been executed and have com- 
pleted. This last routine makes it trivial to record the 
completion of any Generate_File run by simply adding 
the Generate_File.Add_Process_Record to the end of the 
registration statements.’ This has the added advantage 
of not having to modify the source code that doing the 
direct PL/SQL call would require. 


The Add_Process_Record routine actually calls 
the Add_Target_Simple routine to register a special 


2SQL*PLUS is a command line interface to Oracle. One of 


the options is to read from a file and pass the information to 
Oracle for processing. 

3Since the original paper, we have added several other 
kinds of targets in addition to these. 

4The Generate_File registration routines will default to the 
target name specified in earlier calls if not provided on later 
calls. 





procedure Sgbstdn_Full(Target_Pidm in number, 


stop_count in number) 


is 
Banner Sgbstdn_Scan_Curs%RowType; 
Simon Simon_Scan_Curs%RowType; 
Act_Cnt number := 0; 

is 


Open Sgbstdn_Scan_Curs(Target_Pidm) ; 


Open Simon_Scan_Curs(Target_Pidm) ; 
loop 


(Details of processing omitted) 


exit when Act_Cnt > Stop_Count; 
end loop; 
close Sgbstdn_Scan_Curs; 
close By_Pidm_Curs; 


if Act_Cnt > Stop_Count 
then 


-- Full scan if NULL 


e> DEC 


dbms_output.put_line(’Stopped due to stop_count’); 


elsif Target_Pidm is null 
then 


Record.Mark_Proc(Target_Name => ’Student-Sgbstdn’, 
Procedure_Name => ’BStudent_Maint’); 


end if; 
end Sgbstdn_Full; 


Figure 4: Recording run from a PL/SQL procedure. 
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target that just records the fact it was called, and exits 
after writing a few lines to stdout. Since it is called 
from within the Generate_File environment, it can get 
all of the information it needs for recording from that, 
and we don’t need to pass in any parameters. It also 
prepends GENERATE_FILE- to the target name to come 
up with the name to record. 


Special Generate_File Target 


We still have tasks that we are interested in watch- 
ing that are not written in PL/SQL or using Gener- 
ate_File. These might be older file generation programs, 
or just shell scripts run out of cron. The Generate_File 
program has the ability to pass a parameter to the pro- 
cessing routine. We combined this, with a variant of the 
previous routine to have a new Generate_File target that 
will record anything, with a prefix of MANUAL-. 


In Figure six, we have a simple shell script that is 
run from cron to generate the /etc/printcap file for our 


define name=GENERATE_NETWORK_LIST 
prompt Create Package &NAME 
Create or Replace Package &NAME as 
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system. Assuming that the program exists, and runs 
successfully, we then call Generate_File with the target 
Record_Process to record the completion of the task 
MANUAL-Printcap. 


Notifications 


Although it is all well and good for the database 
to know when a process is overdue, we really need 
some way of letting the appropriate people know 
about this. It is important, however, that the mecha- 
nism used is appropriate for the type of failure and the 
urgency of the process. For example, when the process 
feeding password changes into our Active Directory 
server fails, we want to get the service restored within 
minutes. But if the billing run for our backup service 
is a day late, it isn’t a major problem; we normally run 
this two or three times a year. 


Most of the tasks we are monitoring run once or 
twice a day. As a result, we are currently only 


-- Generate web pages documenting our network. 


-- Define the standard interface 


procedure Get_Network_List(Fname out varchar2, Dbmsout out varchar2) ; 
Procedure Get_Router_Html (result out varchar2, pl in varchar2, p2 in varchar2); 
Procedure Get_Subnet_Html (result out varchar2, pl in varchar2, p2 in varchar2); 


Package definition omitted 


end &NAME; 
/ 
begin 
Generate_File.Add_Target_Simple( 
target => ‘web_routers’, 
filename => ’primary_routers.html’, 


get_data_rtn => '&name..Get_Router_Html’) ; 


Generate_File.Add_Target_Complex( 


get_attr_rtn =>’&NAME..Get_Network_List’, 
get_data_rtn =>’&NAME..Get_Subnet_Htm1’) ; 


Generate_File.Add_Process_Record; 
end; 


/ 


Figure 5: Recording Run from a Generate_File target. 





#!/bin/sh 
# 


## Script to Generate /etc/printcap 


FILE_GEN=/campus/rpi/simon/directory/2.0/@sys/bin/Generate_File 
PCAP_GEN=/campus/rpi/simon/printmaster/1.0/@sys/bin/etcprintcap 


i 
if [ -x SPCAP_GEN ]; then 


$PCAP_GEN 

if [ $? -ne 0 ] 

then 
echo "Error in printcap generation! !!!!" 
exit 1 

fi 

SFILE_GEN -target Record_Process -par2 Printcap 


£i 
Figure 6: Recording run from a generic Generate_File target. 
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checking for “‘late” tasks a few times a day, generat- 
ing a report, and mailing it to interested parties. At 
present, we don’t have anything in place to escalate a 
problem that has not been repaired in a timely fashion. 
So far, the notifications are unique and infrequent 
enough that they are not ignored. 


In Figure seven, we have a notification email 
from the system. This is actually sent as an HTML 
page, instead of a plain text message. While this might 
be annoying to some, I already had the code to gener- 
ate the late list as a web page in the administrative 
tool, and I just had to wrap it in a call for Gener- 
ate_File. This has the added advantage that the buttons 
visible on the email message are fully functional, in 
that I can press one and see the detailed information 
for that task. In this case, I have three tasks that are 
late, printing and disk billing (that was desired actu- 
ally, as we were deferring some revenue to the new 
fiscal year) and the Generate_File run that produces our 
Building Directory web pages. In this case, the normal 
run had failed due to the administrative database being 
down for a backup. 


Type Size 
varchar2 128 
varchar2 128 


Contact_List 
Error_Flag 
tification. 


Notify_Delta Number 


Last_Notify_Time Date 
Next_Notify_Time Date 
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Conclusions 


The Task Monitor tool has proven to be very useful 
in detecting things that should be happening and failed 
for some reason. This is especially useful for the infre- 
quent jobs that are easy to forget. Because it is so easy 
to add monitoring to existing tasks, the number of 
things that we watch has grown quickly. 

Existing Limitations 

Not all aspects of the original design have been 
implemented. At present, we are only checking for 
“‘late’’ processes twice a day. Before we can make this 
a more frequent occurrence, we need to implement the 
notification limits. While it may be useful to check for 
late processes every two minutes, I don’t want to get 
an email every two minutes when a monthly task is a 
day late. 


Another part we have not implemented is dealing 
with non regular, but recurring schedules. A number of 
our business applications do not run on the weekends. 
The intention to handle these would be to support cron 
style schedules, and figure the next run time based on the 


Description 


A list of email addresses to contact when problems are detected. 
An optional error message set by the process. Results in immediate no- 


The minimum time in seconds between notifications when an error has 
been detected. 

The time and date when notification was last attempted. 

The time and date when notification will next be attempted if a success- 





ful run has not occurred. 


Table 4: Process_Monitor table — notifications. 






Ea root @rpie du, Monday, Late Lays 






From: root@rpi.edu 
Date: Mon, 8 Jul 2002 12:15:13 -0400 
To: finkej@rpi.edu 
Subject: Late List 


%Spam-Flag: NO 


|_ Display te Biling Biting 
1 


i Display | RCS Disk Billing Biting 





%-Scanned-By: MIMEDefang 2.3 (wav dot roaringpenguin dot com slash mimedefang) 
X%-Spam-Status: No, hits=-98.8 required=5 tests=NO_REAL_NAME,USER_IN_WHITELIST 








Run User 


Finke) wyemr- 42 


‘Name | Family : 


finkej vemr- (42 


Last Run 
3 May 16: 190verdue! 


Run Host 






‘o3Jun 11:250verdue! 





Page maintained by 
Networking and Talecommunications 
The Division of the Chief Information Officer 





|_ Display | iGEN_ FILE. web buildings Directory Criine | ‘SimonXir yemr- -42. server.rpi Leda 04 Jul 03:27 Overdue! 





Rensselaer Polytechnic Institute (RPI) 
110 8th St., Troy, NY 12180-3590 


Figure 7: Sample “late” message. 
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cron format schedule plus the run delta. This would also 
make it easier to handle manual operations that we 
expect to be done each business day. This still does not 
handle holidays; this needs some more thought. 


Future Directions 


All of the tasks we are currently monitoring with 
this system are either directly accessing the database, 
or have the Generate_File program available. However, 
in order to help track activity on other (Unix) systems, 
a syslog or SNMP interface to allow other things to 
occasionally report in might be very useful. 


As the number of system specific tasks, such as 
the last run of CFEngine [1] on a machine, grows, some 
automatic classification and run delta assignment would 
remove the bottleneck of having to assign an owner, 
family, and schedule information for each new service. 
The ability to designate a particular task entry as a 
“prototype” for new tasks of the same name and differ- 
ent system identifier would make this very easy. 


Other notification methods need to be explored. 
The ability to generate syslog or SNMP messages and 
direct them to other monitoring tools could be very 
useful. This could in turn generate pages, or be 
directly incorporated into this system. 


Another extension to this project is as basic 
reminder system. This would just require a tool to 
manually enter a new process, and directly set the 
Next_Run_Time value. This might be used to remind 
people to reset annual allocations or renew licenses 
and service contracts. 


We also have a number of tasks that are started 
on response to some user request, such as a quota 
change request. One approach would be to have the 
request also set the “next run” time for the quota 
change task. However, this approach might run into 
problems if people keep making new requests before 
the timeout is detected. A different approach is to be 
able to make periodic ‘“‘empty”’ requests that will 
require that the task finish all queued work. Both 
options need some consideration. 


Several of our file generation scripts are run out 
of cron. These are basically shell scripts that run the 
Generate_File program with different targets. Some- 
times one or more of these will fail, possibly due to a 
server being down, or PL/SQL packages that need to 
be recompiled. One possible approach would be to add 
a “rerun” flag to the shell script that would be passed 
to the Generate_File program. If set, another special 
target could be added that would have Generate_File 
skip the run if the target was not late. With this, if 
there were some problems, a person could just run the 
shell script with the “rerun” flag, and only those tar- 
gets that were late would get regenerated. 


References and Availability 


All source code for the Simon system is avail- 
able on the web. Please refer to the following URL for 
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details: http://www.rpi.edu/campus/rpi/simon/README. 
simon. 


In addition, all of the Oracle table definitions as 
well as PL/SQL package source are available at http:// 
www.rpi.edu/campus/rpi/simon/misc/Tables/simon. 
Index.html . 


Although this is implemented in Oracle as part of 
the Simon system, there is very little that requires 
Simon or even Oracle. Just about any relational 
database would be able to handle the moderate pro- 
cessing and database needs for this system. Given our 
starting point, most of our examples are deeply tied to 
Simon, but with alternate interfaces such as syslog and 
snmp, there is no reason why this could not be 
deployed without Simon or Oracle. 
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ABSTRACT 


Experiments that analyze dependencies in RedHat Linux and RpmFind.net show disturbing 
conflicts and overlaps between software packages that result in installing multiple differing versions 
of dynamic libraries. The final state of a system containing conflicting packages depends upon the 
order in which packages are installed, as well as user input during the installation process. This leads 
to system states that may or may not have been tested, lowering confidence that the resulting 
software configuration will function properly. We describe the details of the problem, potential 
effects, and potential solutions involving improving the practice of building RPM packages. 


Introduction 


RedHat Package Manager (RPM) files and their 
equivalents have revolutionized the ease with which one 
can add software to a Linux system. But do RPMs 
embody ease, or perhaps danger? The following excerpt 
from Ladislav Bodnar’s article ‘Is RPM Doomed?” [2] 
gives a very accurate account of a situation that most 
system administrators have experienced: 

“You have just found this great software on the 
Internet and off you go to download and install 
it. Its all free and GPL and, as luck would 
have it, the author provides a binary package 
in RPM format. It doesn't take long to down- 
load it, then you run the customary rpm -Uvh 
package-name.rpm command. OOPS! The instal- 
lation fails, reporting a missing dependent pack- 
age without which it will not install or function 
correctly. Off you go again to search the Inter- 
net for the missing library. 


Unfortunately, installing that missing library 
fails because of three other missing libraries 
and two other libraries that come in incorrect 
versions. Depending on how badly you want 
the original package, you have two choices — 
either go and search for all missing dependent 
libraries as well as all libraries dependent on 
the dependent libraries, or you just give up. 
How many times have you given up? 


If you are persistent and lucky, you might even- 
tually install the RPM package. If you are per- 
sistent and not lucky, then you have probably 
acquired a few bumps from banging your head 
against the nearest wall in sheer desperation. 
RPM dependency hell can be a hugely frustrat- 
ing experience — anything from circular depen- 
dencies (the catch 22 situation) to incorrect 
library version when there wouldnt be much 
left untouched had you really persisted in get- 
ting that badly wanted RPM installed.” 


The root of several problems with RPM (and 
many other kinds of package management) is that 
“order matters’? [11]. It is common practice among 
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many administrators to install and uninstall RPMs and 
other kinds of software packages with little concern 
for change control and without keeping a journal of 
the order of modifications. But case studies and theo- 
retical analyses [10, 11] suggest that the only way to 
produce a predictable and reliable system is to decide 
upon some particular order for package installations 
and other configuration actions, and always perform 
the actions in that order. Another independent analysis 
[5] suggests that order matters whenever system con- 
figuration actions do not take a rather restrictive form 
in which all configuration actions are “homogeneous” 
with one another; this means that if two actions 
change the same file, they change it to have the exact 
same content. While this would seem a reasonable 
requirement, in our experience, RPM installations do 
not satisfy this restriction. 


How dangerous is it in practice to ignore this dis- 
cipline of ordering? It seems from practical experi- 
ence that the danger is far greater than most of us real- 
ize. Many of us have managed to put systems into a 
state where ‘‘only starting over is feasible.”” Why is 
this so? 


This study looks at the risks associated with 
installing and uninstalling RPMs. We look at the 
nature of dependencies between packages to under- 
stand how one package has the potential to break 
another. We explain how use of poorly structured 
RPMs causes a “‘validation drift’? in which the final 
system gradually “drifts” over time to a configuration 
that has not been tested. Using global analysis of 
existing RPM repositories, we identify subtle inconsis- 
tencies in well-known RPM packages that can lead us 
to doubt the results of installing them. 


First we must comment that this study is limited 
in several ways. We only study the i386 distributions 
of RedHat 6.2, RedHat 7.2, and the Contrib directory, 
as listed on RedHat.com and RpmFind.net. These 
directories contain highly volatile data that will be 
quickly outdated. Much of the work was done in April 
of 2002 and repeated in July of 2002, with differing 
results due to changes (mostly improvements) in 
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repository files! We are happy to report that one major 
example of repository rot that was discovered in April 
could not be reproduced from July data. We hope that 
none of the inconsistencies we report will ever be 
reported again, because implementors and repository 
managers will be motivated to address the problems 
we have found. Though there may be different incon- 
sistencies, do not expect to necessarily find any of 
these particular problems in future RPM repositories. 


Why RPM? 


There are many package managers available to 
Linux developers and choosing one to focus on was 
difficult. Package managers such as Debian’s DEB 
[13] or Slackware’s TGZ [15] would have been fine 
choices for our analysis. However, the nature of RPM 
makes it a good basis to analyze the nature of installa- 
tion failures and validation drift while also providing a 
framework for thinking about solutions to these prob- 
lems. RPM is deployed in large and small network 
installations and many system administrators depend 
on it for installing and maintaining complex sets of 
software. It has a rich feature set that allows the instal- 
lation and removal of individual packages or sets of 
packages and it maintains an internal database that 
records and verifies each change to the system. 


Like most other package managers, RPM as a 
system allows a user to install packages in a computer, 
while checking an internal database to verify that all 
known prerequisites are met before allowing the pack- 
age to install. After configuration, RPM allows the 
package to run arbitrary binary and script files prior to 
and for the completion of installation and uninstalla- 
tion [1]. Because of this, we concentrate upon the Red 
Hat package management system [1] for Red Hat 6.2 
and 7.2, including the contributed packages available 
from http://www.rpmfind.net. 


Many readers have commented that we already 
know that the contrib directory is broken, so why ana- 
lyze it? We respond that it helps to know how things 
break, so that we can avoid them in the future. We 
knew when we started that there would be serious 
problems, but had no idea of the nature of the prob- 
lems we would find. And, surprisingly, the problems 
we found are not the problems that we expected! 


RPM Dependencies 


In every RPM package there exist several different 
kinds of dependencies. Declared dependencies external 
to the file are contained in the header information in 
each RPM package. Each package declares which ser- 
vices it “provides” and “requires.” A service is nothing 
more than a string. RPM satisfies dependencies by forc- 
ing one to sequence software installations so that ser- 
vices are “provided” by installed packages before they 
are “required” by others. 


But also, each executable file in an RPM archive 
has requirements, some of which can be determined 
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through use of techniques like those of sowhat [4]. 
These internal dependencies are intrinsic to the file 
and may or may not be related to dependencies 
declared in the RPM header. Sowhat’s analysis utilizes 
the output of Idd and is not exhaustive; one can subvert 
it, e.g., by using dlopen to open a dynamic library by 
name, bypassing Idd and Id.so.conf. 

Some kinds of errors are relatively insignificant. 
If an RPM package is over-declared (the set of depen- 
dencies in the header exceeds the set that the program 
needs), the consequence is that extra, possibly useless, 
programs are installed. If a package is under-declared, 
the packages actually used and required are greater 
then what is declared. This means that a package 
installation will fail even if the declared dependency 
requirements are fulfilled. 


Are RPM Dependencies Sloppy? 


The root of all evil in RPM seemed to be — at the 
outset — the way RPM packages are expected to 
declare what they need in order to operate. In RPM, 
there is a simple mechanism for notating dependencies 
between packages. In the package header, each pack- 
age is declared to “provide” zero or more services. 
These are just strings with no real semantic meaning. 
A package that needs a service then ‘‘requires’’ it. 
This mechanism is mainly used to declare dependen- 
cies between packages using a dynamic library and the 
package(s) that might provide a copy of the library, so 
the “services” are typically library base-names. 


Those of us who have experienced ‘‘dependency 
hell” (as documented in the excerpt in the introduc- 
tion) have suspected major problems with the RedHat 
dependency system. We attempted to validate our sus- 
picion by running a simple test to check the difference 
between actual and declared dependencies. The results 
were both more encouraging than expected and, in a 
way, depressing. The true dependency errors seem to 
be few in number and cannot account for the trouble 
many people report in using RPMs. 


Checking Dynamic Library Dependencies 


We could not in general determine all dependen- 
cies for a package, but we can determine all the 
dynamic libraries needed by a package. We did this by 
unpacking each package (using cpio) and skipping 
execution of the installation scripts. Then we ran Idd 
on each executable or dynamic library in the package. 
Finally, we compared the actual dependencies exposed 
by Idd with the declared dependencies in the header. 


Comparing these was a complex process due to 
the free-form nature of dependencies. It was a multi- 
step process that takes into account all the ways a 
dependency can be declared in a collection of RPMs. 
The cases for each target RPM (the RPM from the col- 
lection currently under investigation) include: 

a) The required library is part of the target RPM 
that requires it. In this case, no dependency list- 
ing is required. 
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b) The required library is explicitly required by 
the target RPM and provided by another. This is 
normal. 

c) The required library is part of an RPM that pro- 
vides another service that happens to be 
required by the target RPM. This is an implicit 
dependency based upon a service tag that is 
actually unrelated to the real library depen- 
dency (Figure 1). This is usually bad style for 
dependency declarations, at least for dynamic 
libraries. The only exception is that to save 
space, some implementors use blanket tags for 
service subsystems, e.g., “require qt’’. This is 
to avoid listing all the core libraries of the ser- 
vice explicitly. 

d) The required library is part of the RedHat core 
distribution, for which dependencies are not 
explicitly listed, as their presence in any Red- 
Hat system is assured. 

e) The required library is in another RPM with 
which the target RPM shares no explicit or 
implicit dependency. This is a dependency error. 


To analyze the distribution of these kinds of 
dependencies within RPM repositories, we wrote two 
programs. Rift lists all of the dependencies in a set of 
packages that can be exposed through Idd. Tree reads 
all RPM files in an RPM distribution and outputs all 
dependency and checksum information from the dis- 
tribution. The output of tree, together with the output 
of rift, is fed to a new program deps that categorizes 
dynamic library dependencies into each of the five 
classes above. 


DigitalDJ-0.5-1.i386 
requires libX11.s0.6 @ 
contains /usr/bin/ddj 

needs libX11.s0.6 @ 


XFree86-libs-3.3.2p1 2-1.i386 
-> provides libX11.s0.6 
'—» contains /usr/lib/libX11.s0.6 
(declared) 
—> contains /usr/lib/libXi.so.6 
(undeclared ) 


needs libXi.so.6 © 


Figure 1: Implicit dependency of ddj upon libXi.so.6 
via libX11.so.6. 


Our results for the RedHat i386 distribution are 
shown in Table 1. The good news is that the distribu- 
tion itself, as it comes from RedHat, is rather well- 
constructed with few errors. The errors seem to be sta- 
tistical outliers. Errors we found were isolated to two 
packages. The anaconda runtime package anaconda- 
runtime-7.2-7.i386.rpm_ fails to require —Id-linux.so.2, 
libc.so.6, libresolv.so.2, and libz.so.1. These are part of the 
core distribution so that this has no observable behav- 
ioral effect; it is just a bit sloppy. PyQt-2.4-1.i386.rpm 
fails to require libstdc++-libc6.1-1.s0.2; this is a bit more 
serious and requires user intervention. 

Our results for the RedHat contrib directory (i386) 


are shown in Table 2. Here things become more “‘inter- 
esting.” Most packages are astoundingly well-behaved 
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about declaring their needs. Outright errors are almost a 
statistical outlier. Errors we observed are listed in Table 
3. These are annoying but minor at best. 


Kind of Dependencies 


741 | normal: “requires” and “provides” 
26 
5 


correct. 
internal: package contains library 
Table 1: Dependency types in RedHat 7.2. 


upon which it depends. 
errors: dependency declaration 
omitted. 
Kind of Dependencies 
1201 | normal: “requires” and “provides” 
correct. 
21 | internal: package contains library 
upon which it depends. 
9 | implicit: unrelated dependency 
8 
Table 2: Dependency types in RedHat Contrib. 
Eterm-0.8.8-1.1386.rpm libungif.so.4 
Frodo-4.1a-1.i386.rpm 


includes file. 
errors: dependency declaration 
ImageMagick-4.2.7- . 
1.i386.rpm libbz2.so.0 
ImageMagick-perl- 
4.2.7-1.1386.rpm libbz2.so.0 






























omitted. 
Qtabman-0.1-1.1386.rpm_ | libclntsh.so.1.0 
XITE-3.3-3.1386.rpm libjpeg.so.62 
XITE-3.3-3.i386.rpm 
aktion-0.2.1-1.1386.rpm libstde++-libc6.1-1.s0.2 


Table 3: Dependency errors in RedHat i386 Contrib. 


One disturbing tendency was some use of implicit 
loading of libraries. There were nine instances in which 
a dynamic library was required implicitly as a side- 
effect of another explicit requirement. The X11 library 
libXi.so.6 was implicitly required six times as a result of 
explicitly requiring libX11.s0.6. Likewise, libdl.so.g was 
implicitly required three times as a result of explicitly 
requiring libc.so.6. These libraries were also among 
those that were formerly bundled with the libraries 
whose dependencies load them. The implicit loads of 
these libraries are probably due to packages being 
designed before that library design change took place. 


The prognosis of this work is surprisingly good. 
While it would seem that the contributed RPM reposi- 
tory would be chaos, our simple checks showed con- 
tributed RPMs to be relatively organized and well- 
structured. Our obvious question, then, is ‘“‘what is 
really wrong?” It is not the dependencies, because 
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errors in these are statistically insignificant. We must 
look deeper for potential problems within RPMs. The 
key to this looking deeper is the concept of validation 
of the resulting system. 


Validation 


The central theme of this paper is not depen- 
dency analysis itself, but rather the relationship 
between dependency analysis and validation of soft- 
ware installation. In software engineering [9], there 
are two forms of sanity checking: 

e “verification”: ‘tare we making the product 
right?” Does it conform to our own ideas of 
how it should work? 

e “validation”: are we making “‘the right prod- 
uct?” Does it conform to customer needs? 


In system administration, we often concentrate 
upon validation to the exclusion of any concept of veri- 
fication. We have less design freedom than software 
authors, and user requirements are usually more com- 
pletely spelled out than for a software engineer, so that 
there is much less difference between “verification” 
and “validation” than there would be in a software 
development environment. The key to validation is rig- 
orous testing [12] in a realistic setting. One must actu- 
ally try the system and see if things work as expected. 


Last year, Couch and Sun described global anal- 
ysis [4] without emphasizing validation, its most 
important component. Any analysis of what went 
wrong with a system must be tempered by knowledge 
that at some time in the past, ‘things were right.” 
One’s reasoning must always flow from knowledge 
that the system did work properly before. Otherwise, 
the question of “what broke’ — central to use of 
sowhat — has no meaning. Unfortunately, it is often the 
case that the user thinks ‘things were right” in their 
system configuration when in fact they have been 
working with a system that was not completely vali- 
dated. This perception by the user can contribute to 
the problem at hand. Dependency problems are always 
problems of validation: ‘‘can we be sure that in mak- 
ing a change, we do not break anything?”. Given a 
rather strongly validated core distribution of RPMs, 
how can we avoid breaking anything in it that worked 
before? 


Validation Rot 


Brooks [3] points out that in software engineer- 
ing, ‘‘software rot’ occurs when too many small 
changes are made to a complex system, so that no one 
really understands the function of the software. Couch 
points out that a similar kind of “filesystem rot” 
occurs in software repositories managed over long 
time periods [6], so that meanings of specific files 
become unclear and inaccessible. 


Our analyses show that RedHat machines man- 
aged via RPM suffer from a new kind of rot: “valida- 
tion rot.”’ This is a gradual divergence from a fully 
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tested configuration that invalidates and undermines 
prior testing. Here is how it works: 
¢ We start with a “baseline configuration,” e.g., a 
RedHat distribution. This configuration is pre- 
sent on a multitude of hosts around the internet, 
so we can wait until this baseline has been 
comprehensively validated by an extensive 
community of users. 
Gradually, over time, we add functionality in 
the form of ‘contributed RPMs.” Each one of 
these adds some files and may replace others. 
The result is that the system diverges not only 
from the baseline, but also into a fairly unique 
state that may not be replicated anywhere else 
on the Internet. 
e At any point in this process, one has a unique 
system that has never been validated by any- 
one. 


The user community (and vendors) validate 
packages in relation to the baseline, not in relation to 
other packages. It is nearly impossible to test, yet 
alone reach, all possible package states due to combi- 
natorial explosions. This means that a user is ‘“‘at risk” 
when their system reaches a configuration state that 
has not been achieved by any previous population of 
users or the software developer. It is possible, then, 
that there are latent bugs that will show up only on 
that particular machine. 


Validation Expense 


Validating software is expensive. For commer- 
cial software, it often requires the expertise of a Soft- 
ware Quality Assurance(SQA) [9] team. For open- 
source freeware, the user community itself often 
serves that purpose over longer time periods, submit- 
ting bug reports and fixes. Either way, software can 
only be validated as working properly by extensive 
testing in multiple environments and with various 
kinds of inputs. There is no such thing as ‘‘completely 
tested software’ [10] and one must always decide 
what form of testing is “‘good enough” or “complete 
enough” [12]. 

Transitive Validation 


The expense of validation has led the Linux Stan- 
dard Base [14] to employ a “transitive validation” 
strategy. Previously, a vendor wanting to market a soft- 
ware package for linux had to validate its function on 
every distribution of linux. The Linux Standard Base 
was createc in order to give vendors the assurance that 
a package that works somewhere, works everywhere. If 
we control the couplings between a software package 
and its operating environment, and can validate the 
environment as possessing appropriate couplings, then 
validating it in one compliant environment validates it 
in all such environments. 


There are two parts to the Linux Standard Base: 
1) a code validator that indicates whether a spe- 
cific binary file is compliant with the base. This 
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checks whether the system calls used by the 
binary file are loaded from correct versions of 
dynamic libraries. 

an environment validator that checks whether 
the environment on a specific linux machine 
complies with the minimal requirements needed 
for system calls to work properly. This checks 
not only the existence and versions of key 
libraries, but also checks that particular system 
control files are found in standardized loca- 
tions, e.g., /etc/hosts. 


The key assertion chain of LSB is that: 

a) If a particular vendor software package passes 
code validation, 1.e., only utilizes approved sys- 
tem and library calls, and 

b) The vendor package has been tested on one 
LSB-compliant system, and 

c) A particular linux system passes environmental 
validation, i.e., has all its libraries and files in 
appropriate places, then 

d) The vendor software should function fine on 
any environmentally compliant system. 


2 


— 


This is a “‘transitive validation” claim: software 
that works in one compliant environment works in 
every such environment. This can potentially save 
tons of money in validating software for different dis- 
tributions of linux. 


Is Validation Trust Transitive? 


We are inspired by the LSB strategy and would 
like to apply a similar process to the problem of vali- 
dating RPM-managed systems. The key question is 
“What can we trust?”’. Trying to answer this question 
cuts to the heart of the RPM problem. We contend that 
“we usually put too much trust in existing infrastruc- 
ture.” 


For example, let’s consider the following appar- 
ent assertions about “transitive trust”’: 

1) If a program works properly on a RedHat 7.2 
system, it will work properly on all RedHat 7.2 
systems. 

2) If a script works properly for a particular ver- 
sion of Perl, then it will work properly in the 
same version of Perl regardless of the environ- 
ment in which it executes. 

3) If a program or dynamic library compiled with 
one compiler works properly, then it will work 
properly if compiled with another compiler. 

4) If a program works before new software is 
installed, it will work after the software is 
installed. 


In each case, we decide to trust something in a 
new situation based upon validation in an old situa- 
tion. All of the above assertions seem reasonable, and 
all are quite obviously false to the point of being ludi- 
crous. Each of these points can be expressed in realis- 
tic terms as follows: 

1) Just what is a RedHat 7.2 system? This is the 
baseline, but what has been done to the system 
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since then? If a program depends, e.g., upon 
letc/foo, then it will only work if one has 
installed /etc/foo. This has nothing to do with 
the baseline. 

2) Any Perl programmer knows that Perl does 
some rather strange things to cope with system 
differences. For example, its implementation of 
lockf can take at least four forms depending 
upon support for locking in the operating sys- 
tem. These forms are semantically different. 

3) Modern compilers have bugs, especially when 
optimization is turned on. Validation under one 
compiler is no guarantee of function when the 
same program is compiled with another. 

4) Even in the simplest of cases, it is easy to break 
a program by installing another. The problem is 
“hidden dependencies” between programs and 
other programs and libraries. 


These simple examples are obviously bogus, but 
administrators who install RPMs on an ad-hoc basis 
are using them as assumptions. We come to the inex- 
orable conclusion that a system is validated as func- 
tional if: 

1) it is constructed starting from a validated base- 
line, 

2) all software packages installed in addition to 
the baseline are: 

a) validated against the baseline configura- 
tion by being installed against it and thor- 
oughly tested. 

b) homogeneous [5], in the sense that over- 
laps between packages other than the base- 
line install the exact same content. 

c) uncoupled from the contents of other pack- 
ages (excluding homogeneous overlaps), 
so that software within each package only 
refers to baseline content and the content 
of the specific package. 


Analysis of RPM Failures 


There are four main things that can interfere with 
the proper operation of a single package: 

1) Hidden dependencies not known to the package 
designer. 

2) Version skew between files and the programs 
that utilize them. 

3) Relationships between files that are obscured 
by scripting. 

4) Asynchronous operations other than package 
management that affect package files or 
required files. 


Hidden dependencies are those unknown to the 
package designer or simply undeclared. Every pack- 
age implicitly depends, e.g., upon the whole base dis- 
tribution. If something in the base distribution 
changes, the package may break, but such dependen- 
cies are never made explicit. According to the results 
above, though, these may be statistically insignificant. 
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Version skew is a very common problem with 
which many people are familiar in both the Unix and 
Windows worlds. This happens when a library or pro- 
gram associated with more then one program is 
upgraded and the newer version is not functionally 
compatible with the older version. 


Installation scripts, which are commonly 
included with programs today, are usually designed to 
move files and create directories that are custom to the 
package. But with larger, more complex packages, 
installation scripts are non-trivial and can perform 
tasks that have system-wide effects. Since the changes 
that an installation script can make are limited only by 
the rights of the user running it, (the user is typically 
root) any program has the ability to touch any other 
program or file. A common example is when an 
Apache RPM is installed. It makes modifications to 
the inetd.conf file that are not obvious if one is not 
aware that modifications are being made. 


Asynchronous operations, which include 
installing non-RPMs, manually changing files that 
rpm controls, hacking, etc., can also have a com- 
pelling effect on the validity of installations. Since 
most complex systems span multiple volumes, when a 
package is located on one volume and a dependent 
package is located on a different volume, both vol- 
umes must be mounted or the dependent package may 
not work. This type of dependency is difficult to prop- 
erly diagnose. 


Global Analysis 


The problem with the above descriptions of fail- 
ure modes is that they are all module-centric. They 
can explain what’s happening when one module is 
installed, but do not depict potentially subtle multi- 
module interactions. We applied the global analysis 
techniques of sowhat [4] to this problem and found 
potential and subtle failures in adding RPMs to the 
standard distributions. While the configuration lan- 
guage for RPM files allows expression of “backward” 
dependencies between referrer and required resource, 
“forward” dependencies between a new resource and 
an old program that uses it can lead to systems whose 
function has not been properly tested. 


Our Experiments 


We undertook several experiments to understand 
the scope of the problem of validation drift. Our first 
task was to identify the scope of inhomogeneity within 
RPM packages. We obtained several package distribu- 
tions (for the i386 architecture) using the sites red- 
hat.com and rpmfind.net. We then wrote several custom 
scripts to analyze their contents and pinpoint potential 
validation problems. 


Inhomogeneities 


Because of the computational difficulty of read- 
ing large distributions, we broke our analysis into sev- 
eral short scripts, each of which provides input to the 
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next. Our first Perl script tree reads all RPM files in an 
RPM distribution and outputs all dependency and 
checksum information from the distribution as a text 
file. This data is then read by a second script, munch, 
which computes a list of files that are inhomoge- 
neously provided by more than one package, as indi- 
cated by differing MD5 signatures for the exact same 
file path. We then went over these inhomogeneities 
using a filtering script punch to eliminate conflicts 
based upon hardware differences (i386 vs. i686) 
where needed. This was done based upon the naming 
conventions for RPM files. 


Results of this process differed greatly depending 
upon where we tried to do it. RedHat 7.2 (i386) 
exposed no conflicts whatsoever. RedHat 6.2 (i386) 
exposed 14 conflicts, of which one was a difference 
between a software bundle and an individual package; 
three were due to packages that are mutually exclusive, 
e.g., mail delivery agents; six were due to multiple ver- 
sions of the same software, and the remaining four were 
due to unforeseen and simple packaging mistakes. Each 
“conflict” is a system file that has multiple versions 
listed in the RPM repository, where there may be up to 
five versions available for a single conflict. 


But it should be no surprise that as a result of 
performing this process on the contrib directory, we 
found 3499 inhomogeneities distributed as in Table 4. 
The majority of the problems were due to the presence 
of multiple versions of the same software, sometimes 
with recognizable naming patterns, sometimes not. 


Kind of Error 


differing versions of the same software. 
file version conflicts in apache modules. 
software packages are mutually exclu- 
sive by design. 

software bundle disagrees with individ- 
ual tool package. 

inconsistent versions in overlapping 
software bundles. 

other conflict 















Table 4: Inhomogeneities found in contrib directory 
of rpmfind.net. 


Apache contributed modules were a source of 
great chaos; conflicts encompassed everything from 
HTML and GIFs to dynamic libraries as described in 
Table 5. One big surprise is that a single apache add- 
on, php, was responsible for 125 of the 238 file ver- 
sion conflicts for apache modules. Most of this was 
due to replacing — for no reason apparent to us — much 
of the HTML documentation for apache itself inside 
apache_php3-1.3b6-1.i386.rpm. This is a classic case of 
“repository poisoning;’’ one RPM creates inconsisten- 
cies that affect several others. 


Another much more potentially serious problem 
is that there were 36 dynamic libraries with multiple 
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versions, as listed in Table 6. These are the libraries 
loaded by Apache httpd itself. This was due to the 
contents of only five modules as described in Table 7. 
Each of these modules contained copies of between 32 
and 36 dynamic libraries. 


HTML documentation (.html) 
dynamic libraries (.so) 
manual pages (.1-.8) 
header files (.h) 
executables 
configuration files 
other 









Table 5: Kinds of file conflicts in Apache. 


The remainder of the inhomogeneities were due 
to several kinds of problems. Several packages were 
mutually exclusive by design, e.g., mail delivery 
agents or service daemons for the same service. 52 
times, a software bundle disagreed with the package 
containing an individual tool added to the bundle. 25 
files were inconsistent among two or more different 
software bundles. nine conflicts were simply unfore- 
seen couplings between files and modules, e.g., 
expect-5.31.2-2.rh6.1.1386.rpm_ surprisingly instantiates 
/usr/bin/rftp along with socks-4.3.beta2-2.i386.rpm. 


Binary Differences 


At a more detailed level, while executables and 
libraries that are exact binary copies of others are func- 
tionally identical, executables and libraries that exhibit 
binary differences may or may not be interchangeable. 
Binary differences are evidence that there may be a 
functional difference, but this functional difference may 
or may not exist when the programs are executed. 


Binary differences between files can also occur 
for completely gratuitous reasons. One can use two 
different compilers to compile the same .c file to get 
two object files that are different in binary but identi- 
cal in text and function. As indicated by the above 
analysis, validation of code is not invariant of choice 
of compiler. 


Our goal was to look at the grouping of RPMs 
available to us and determine which executables and 
libraries showing binary differences were possibly 
problematic and not caused by gratuitous metadata. 
The basic issue is whether two files that differ do so in 
a way that changes the behavior of programs. The files 
upon which we concentrated are all the Extensible 
Link Format [8] files in a linux system, including exe- 
cutables and dynamic libraries (.so). 


We compared two different versions of the same 
file through a C program (provided by our advisor 
Alva Couch) that compares the binary contents of the 
text, data, and bss segments of an ELF [8] file. If two 
ELF files — executables, libraries, etc. — do not differ 
in text, data, and bss segments, then they are 
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functionally equivalent, even if they differ in other 
metadata such as date, compiler version, etc. 









/usr/lib/apache/mod_access.so 
/usr/lib/apache/mod_actions.so 
/usr/lib/apache/mod_alias.so 
/usr/lib/apache/mod_asis.so 
/usr/lib/apache/mod_auth.so 
/usr/lib/apache/mod_auth_anon.so 
/usr/lib/apache/mod_auth_db.so 
/usr/lib/apache/mod_autoindex.so 
/usr/lib/apache/mod_cern_meta.so 
/usr/lib/apache/mod_cgi.so 
/usr/lib/apache/mod_digest.so 
/usr/lib/apache/mod_dir.so 
/usr/lib/apache/mod_env.so 
/usr/lib/apache/mod_example.so 
/usr/lib/apache/mod_expires.so 
/usr/lib/apache/mod_headers.so 
/usr/lib/apache/mod_imap.so 
/usr/lib/apache/mod_include.so 
/usr/lib/apache/mod_info.so 
/usr/lib/apache/mod_mime.so 
/usr/lib/apache/mod_mime_magic.so 
/usr/lib/apache/mod_mmap_static.so 
/usr/lib/apache/mod_negotiation.so 
/usr/lib/apache/mod_rewrite.so 
/usr/lib/apache/mod_setenvif.so 
/usr/lib/apache/mod_speling.so 
/usr/lib/apache/mod_status.so 
/usr/lib/apache/mod_userdir.so 
/usr/lib/apache/mod_usertrack.so 
/usr/lib/apache/libproxy.so 
/usr/lib/apache/mod_log_agent.so 
/usr/lib/apache/mod_log_config.so 
/usr/lib/apache/mod_log_referer.so 
/usr/lib/apache/mod_unique_id.so 
/usr/lib/apache/mod_bandwidth.so 
/usr/lib/apache/mod_vhost_alias.so 
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Table 6: Number of multiple versions of each Apache 
dynamic library. 


RPM file 
apache-fp-1.3.3-1.1386.rpm 
apache-mod_ssl-fp2000-1.3.12.2.6.2- 

0.6.0.1386.rpm 


apache-php3perl-1.3.12-3nosyb.i386.rpm 
apache-ssl-jserv-1.3.2-2.1386.rpm 
apache_modperl-1.3.6-1.19-1.i386.rpm 





Table 7: Apache RPMs asserting conflicting dynamic 
libraries. 


The result of this analysis was that of the 981 
inhomogeneities in ELF format, only 13 of these 
turned out to be functionally equivalent, and 10 of 
these equivalences arose from two differing revisions 
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of the same package. The remaining three were 
unusual; It turns out that the two RPMs: 

*¢ apache-mod_ssl-fp2000-1.3.12.2.6.2-0.6.0.i386.rpm 

e apache-php3perl-1.3.12-3nosyb.i386.rpm 
have exactly equivalent copies of mod_unique_id.so, 
mod_log_referer.so, and mod_log_agent.so, for reasons 
we do not understand. 


In summary, lack of equivalence is the rule. 
There were 968 inhomogeneities that were not able to 
be proven as functionally equivalent because they dif- 
fer in text, data, or bss segment. 


Update Skew 


We know that RedHat validates the core distribu- 
tion and each update, and have verified that their core 
RPMs are homogeneous in RedHat 7.2. But our analy- 
ses of contributed RPMs show that it is easy to par- 
tially update a system so that updated files are version- 
skewed with respect to one another. 


In several cases, notably involving contributed 
versions of Apache, updates overlap in asserting new 
contents for particular libraries. RPMs assure that back- 
ward dependencies are satisfied (so that all libraries 
used by executables in the update are updated), but fail 
to update so that forward dependencies are satisfied. 
Any other pre-existing program that happens to use the 
same library is at risk of malfunctioning unless it is 
updated or validated against the new library as well. 


In analyzing global dependencies in the RedHat 
distribution, updates, and contributed modules, we 
have observed two main kinds of failure of validation. 
Either an existing program is forced to use new 
libraries with unpredictable results, or the contents of 
a specific library depend upon installation order. 

Case Study: Updating Apache 

In Apache, the exact contents of a library depend 
upon the sequence of RPM installations. It is consid- 
ered good practice to encapsulate apache updates, 
extensions, and modules into individual RPM packages, 
where each package contains an Apache module and all 
files that the module might potentially need [1,7]. This 
means, however, that many files in the main Apache 
package are duplicated among the update packages. If 
duplicated files are identical across all packages, there 
are few potential problems, but if they differ, we have 
reason to suspect that system states are attainable that 
have not been tested by anyone. 


By a simple signature analysis, we found, e.g., 
that in the updates to RedHat 7.2, there are five differ- 
ent versions of /usr/libfapache/mod_autoindex.so con- 
tained in five module RPMs: 

¢ apache_modperl-1.3.6-1.19-1.i386.rpm 

e apache-mod_ssl-fp2000-1.3.12.2.6.2-0.6.0.i386.rpm 

* apache-ssl-jserv-1.3.2-2.i386.rpm 

¢ apache-fp-1.3.3-1.i386.rpm 

e apache-php3perl-1.3.12-3nosyb.i386.rpm 
In principle, all copies of mod_autoindex.so should be 
identical in function, but comparison of md5 signatures 
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shows that all these files differ in binary content. This 
means that there are five different states for this file on 
an updated system, depending upon which updated 
module is installed /ast. 





Order of 
Installation 
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Figure 2: The unseen effects of updating apache: val- 
idation rot. 





Figure 2 shows one possible configuration that 
could be used as an installation sequence for the five 
apache RPMs described above. The arrows from each 
RPM to mod_autoindex.so represent that RPMs instal- 
lation of mod_autoindex.so. The problem here is that 
each RPM file has its own version of mod_autoin- 
dex.so. Thus, based on the order of installation and 
user input (about replacing files), the RPM that is 
installed last is the one that determines which version 
of mod_autoindex.so will be the one used by all of the 
packages once they are installed. This can cause the 
user to fall into an untested state and ultimately lead to 
a system failure. 


Now in an ideal world, all that might differ in the 
various copies of mod_autoindex.so is “circumstantial” 
and does not affect behavior, e.g., the time that the 
source file was compiled. We actually think this is the 
case here, but have no easy way to validate contents 
against source code. This configuration situation rep- 
resents a risk that one or more of these copies exhibits 
different behavior than the others. To be sure that a 
system works properly after updating, we need to 
know that the version that we have has been validated 
with the other modules that use this library. 


Lessons Learned 


Our preliminary analysis of the ‘‘contrib”’ branch 
of the RedHat distribution indicates that there are 
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roughly 324 potential library version skew conflicts, 
without even considering supporting executables and 
scripts. We believe that an even more detailed analysis 
will expose many more skew conflicts. These conflicts 
overshadow the dependency errors in RPM declara- 
tions, which are statistically insignificant by compari- 
son. While these types of errors may be statistically 
insignificant, they can be annoying and can cause 
major problems in some instances 


Many administrators suffer from the illusion that 
one can install and manage packages in a relatively 
ad-hoc way at low cost. This illusion is shattered by 
considering the implications and cost of testing of ad- 
hoc configurations with the same rigor with which 
core distributions are tested. Testing is very costly, we 
consider it the vendors’ responsibility, and yet we put 
our systems into states the vendor could not test and 
validate. If we fail to test the system ourselves, then 
the user of the system will inadvertently test it for us. 


Using global analysis techniques, it is possible to 
predict whether one is moving to an untested configu- 
ration and to take corrective action so that one’s sys- 
tem remains one that is widely used and tested. It is 
possible to minimize divergences between one’s con- 
figurations and those used by the broader community, 
and to understand the cost of ad-hoc or divergent 
administrative methods. 


Changing Practices 


So what can we do differently to avoid these 
problems and assure that what we install will work 
properly? The key seems to be a different attitude and 
technique in creating and using RPMs. We can 
demand homogeneity in RPMs contributed by out- 
siders. We can analyze their dependencies and validate 
this homogeneity to some extent. We can minimize the 
effects of scripting by re-architecting RPMs and sys- 
tems to have simpler script requirements, e.g., by cre- 
ating directories rather than files, i.e., xinetd.conf rather 
than inetd.conf. 


Avoiding gratuitous changes to validated baseline 
distributions will help ease the problems as well. We 
must look at changes more carefully before we make 
them. Each change not only represents a modification 
to our system but also pushes us farther from the base- 
line system. By constantly updating and changing our 
systems, we are moving farther and farther away from 
the baseline and a system that we know is validated. By 
viewing system changes as both updates and jumps 
from the baseline, we must make more informed deci- 
sions before we gratuitously install packages. 


We can also avoid conflicts over dynamic 
libraries by making crucial libraries specific to the 
packages they serve. If a package needs a special 
dynamic library, name it differently than the normal 
one to avoid conflicts. This will help alleviate the 
problem of a library getting updated from one package 
when another package relies on the old version of it. 
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By coupling packages with the libraries that are cru- 
cial to their proper functioning, we can remove a prob- 
lem that is difficult to recognize prior to system fail- 
ures. This strategy trades memory for robustness; one 
must often make a library effectively unshared in 
order to isolate it from interactions. 


Avoiding update skews in large packages by 
updating coupled executables and libraries simultane- 
ously will continue to solve the problem of multiple 
packages relying on a single library. When several 
packages rely on an important library, an upgrade of 
any one of them can create unpredictable behavior. 
Even more problematic is the differing behavior we 
observe depending on the order that packages are 
updated. By coupling update packages together, these 
problems will no longer have to be addressed. 


But as system administrators, we need to be able 
to deal with these issues now without waiting for 
package development practices to change. We must 
fully understand the problems that our unique systems 
may face through global analysis. Understanding the 
causes of potential problems and recognizing warning 
signs when updating our systems will only help make 
our machines and systems more stable. Even without 
changing the way packages are developed, we can 
help to protect ourselves from the headaches of valida- 
tion drift by using proper practices. For example, 
installation order of packages must be carefully 
observed as it is up to the system administrator to 
detect problematic conditions and assure the packages 
are installed in the proper order. 


Many of the practices mentioned above trade 
space for validation. Nowadays, space in systems is 
cheap, while validation is expensive. By changing the 
practices we use in administering Linux systems, par- 
ticularly with RPMs, we can save ourselves time and 
effort. The cost of this is the added space that multiple 
copies of libraries and packages may take up. But 
when you weigh the benefits versus the drawbacks, it 
is clear that a change in practices will help everyone. 


Future Work 


While the global analysis techniques upon which 
we report are based upon the strategies in sowhat, we 
have yet to integrate these into the tool proper. While 
we have a good grasp of the problem, a truly practical 
methodology seems to require this integration. We 
expect to do this some time in the coming year. 


Further, we intend to look at the arbitrary script 
actions performed by RPMs before and after installa- 
tion. These scripts can cause numerous things in a sys- 
tem to change; things that the user probably does not 
know are being changed. By looking at these scripts and 
comparing them against one another, we will be able to 
get an even better handle on exactly what an RPM is 
doing when updating a system in a baseline state. 


But this work is just the tip of the iceberg. 
Homogeneity of packages is necessary, but not 
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sufficient. A truly practical strategy would account for 
all dependencies; not just those one can discover with 
Idd, but also dependencies that can only be discovered 
by tracing library references. We and Yizhan Sun are 
also working on wrapping library calls to trace per- 
haps non-conventional use of dynamic libraries (using 
dlopen), and even tracing — at a fine grain — actual file 
use in a live system. These measures will give us a 
better idea of the true dependencies in a running sys- 
tem that can be violated by poor practice. 


Open Questions and Controversies 


This work also brings up some rather important 
open questions for study by the whole community of 
RPM users. These are not questions that we feel that 
we can address ourselves with the technology we 
have. We leave them to other researchers. 


One of the most heated controversies in current 
systems management is whether binary equivalence is 
necessary for behavioral equivalence of programs, 
libraries, or systems [10, 11]. It seems that the com- 
munity is strongly divided into factions of “theorists” 
and “practitioners.” Some “practitioners” believe 
that only identical binary files are guaranteed to 
behave identically (our premise) and that differing 
compilers, for example, cannot be trusted to compile 
the same source code with identical behavioral results. 
Some “theorists” believe, however, that ‘‘we should 
be able to write compilers that perfect’? and that the 
source code should be the real measure of equiva- 
lence. Some extremists also argue that even differing 
source code can be proved behaviorally equivalent by 
compiler optimization techniques. We take the very 
conservative position that “only practical techniques 
can be applied now” and thus believe binary equiva- 
lence the only currently practical measure of behav- 
ioral equivalence. Only time will tell whether the other 
ideas of equivalence will be practical. 


Another open question concerns the general 
nature of dependency. So far, we can only describe 
dependencies in a very coarse way, by saying which 
files should be present or which packages should be 
installed. Dependency, in general, is a much more 
complex thing. It creates limits on the contents of files 
as well as their presence and location. How far can we 
go with describing dependencies before the cure (of 
discovering and declaring dependencies) is worse than 
the disease (of dependency failure)? 


Conclusions 


We have shown that problems in RedHat installa- 
tions are not always caused by problems with depen- 
dencies between packages, but instead (and perhaps 
more commonly) by overlaps between packages. 
Dependencies declared inside a typical collection of 
RPMs are surprisingly accurate. But overlaps between 
package files seem to be a plague upon both closed and 
open repositories containing reusable binary RPMs. 
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In casually installing RPMs in a Linux system, it 
is easily possible to put the system into a state that no 
one has validated or tested. While for a “home com- 
puter” the risk of down time is fairly low, in an enter- 
prise management strategy, such ad-hoc system 
updates should be avoided in favor of staying near 
configurations that have been extensively tested and 
“bummed in” by the community. We show that devia- 
tions from tested states can sometimes be detected 
before an RPM is installed by global analysis of all 
RPMs available. We also suggest that RPMs be con- 
structed so that any combination, in any order, always 
results in a validated system state. This is easy to 
accomplish by isolating dependencies and avoiding 
inhomogeneous overlaps, but seemingly only the 
major distributions have managed to do this properly. 
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RTG: A Scalable SNMP Statistics 
Architecture for Service Providers 


Robert Beverly — MIT Laboratory for Computer Science’ 


ABSTRACT 


SNMP is the standard protocol used to manage IP networks. Service providers often analyze 
the utilization statistics available from SNMP-enabled devices to make informed engineering 
decisions, diagnose faults and perform billing. However collecting and efficiently storing large 
amounts of time-series data quickly, without impacting network or device performance, is 
challenging in very large installations. We identify three crucial requirements for an SNMP 
statistical solution: (i) support for hundreds of devices each with thousands of objects; (ii) the 
ability to retain the data indefinitely; and (iii) an abstract interface to the data. We then compare 
the applicability of several tools in a service provider environment. Finally, we detail Real Traffic 
Grabber (RTG), an application currently in use on our national IP backbone which we developed 
in lieu of existing packages to meet our requirements. 


Introduction 


Data traffic statistics are valuable in all networks, 
but are particularly crucial in service provider and 
enterprise environments. Not only is the data used to 
make informed engineering decisions such as traffic 
engineering, capacity planning and over-subscription 
analysis, it is also used for denial-of-service tracking, 
billing and policy purposes. The Simple Network 
Management Protocol (SNMP) [4] is the standard pro- 
tocol used for fault detection, diagnostics, device man- 
agement and statistics gathering in IP networks. While 
the protocol mechanisms themselves are “simple,”’ the 
process of continually collecting, retaining, reporting 
and visualizing SNMP statistics data presents unique 
constraints in very large network installations. 


Worldcom is a large service provider with many 
disparate data networks and equally varied manage- 
ment systems. The particular Worldcom network we 
monitor is a national OC-48c (2.5 Gbps) backbone that 
has grown to approximately 110 devices each with an 
average of 100 interfaces. Scalability problems regu- 
larly plague service providers and other large networks 
scrambling to keep up with growth. The legacy sys- 
tems gathering SNMP interface utilization statistics 
from our network were no exception and faced severe 
performance problems to the extent that they were 
unusable. Simultaneously, new requirements to moni- 
tor additional per-interface statistics emerged along 
with a need to generate various custom reports. 


At a minimum, we needed a new system that 
could record bytes, packets and errors for every inter- 
face in the network with a five-minute granularity. The 
system must also produce long-term (multiple-year) 
trends and reports and keep detailed usage information 
for billing and legal purposes. We identified three high- 
level requirements for our new system: the ability to (i) 


'The initial research was completed while with Worldcom. 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


scale the statistics infrastructure to support hundreds of 
devices each with thousands of objects; (ii) retain the 
data indefinitely; and (iii) provide an abstract interface 
to the data. These requirements motivated the develop- 
ment of Real Traffic Grabber (RTG). 


RTG is a flexible, scalable, high-performance 
SNMP statistics monitoring system. All collected data 
is inserted into a relational database that provides a 
common interface for applications to generate com- 
plex queries and reports. RTG has many unique prop- 
erties including: it runs as a daemon incurring no cron 
or kernel startup overhead, it is written entirely in C 
for speed incurring no interpreter overhead, it is fully 
multi-threaded for asynchronous polling and database 
insertion, it performs no data averaging and it can poll at 
sub-one-minute intervals. RTG runs in production on 
several networks and has proved to be an invaluable tool. 


In this paper we first compare the applicability of 
various open-source tools and solutions in a service 
provider statistics environment. Next we detail the 
implementation and operation of RTG. We then pre- 
sent performance data measured for various monitor- 
ing platforms including RTG. Finally, we present 
graphs and reports unique to RTG. The paper con- 
cludes with availability information and suggestions 
for future development. 


Survey of Existing Monitoring Applications 


There are many open-source tools for gathering 
SNMP data; CAIDA? maintains an excellent list of 
Internet measurement tools [3]. We experimented with 
several of the most popular applications and while 
they were ideal for many circumstances, none fulfilled 
our complete requirements. 


A widely popular open-source application for 
visualizing link traffic is the Multi Router Traffic 


2Cooperative Association for Internet Data Analysis, 


http://www.caida.org 
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Grapher (MRTG) [7, 8]. MRTG is a Perl script that 
reads a configuration file and SNMP polls the listed 
devices. An external C program adds the result to an 
ASCII log file and then produces a series of traffic 
plots and corresponding HTML code. MRTG assumes 
that as the data ages, the importance of detailed infor- 
mation diminishes proportionately. Subsequently, 
MRTG implements a lossy storage mechanism 
whereby multiple older samples are averaged into a 
single data point representative of the entire time 
period to ensure a fixed-size database. 


The primary advantages of MRTG are its ease of 
setup and use and friendly web output. While the visu- 
alization piece was excellent, MRTG was not suitable 
for our environment for several reasons. With such a 
large network, MRTG could not poll and process all of 
the objects in the network within a five-minute inter- 
val. A MRTG process that did not complete on time 
led to multiple MRTG processes piling up, as MRTG 
is forcibly invoked via cron each sampling interval, 
exacerbating the speed problem further. The MRTG 
performance problems are due to its use of a flat 
ASCII log file, sequential SNMP polling and insis- 
tence on generating a new graphic image (either GIF 
or PNG) for each object every five-minutes. We exam- 
ine the performance characteristics of MRTG in detail 
in a subsequent section. A second disadvantage with 
MRTG lies in its use of a fixed size database which 
guarantees decreasing resolution with time and the 
eventual discard of old samples. Further, the ASCII 
log file is difficult for other applications to interface 
with or correlate to a particular set of customers easily. 


In response to the performance issues in MRTG, 
the primary MRTG author created the Round Robin 
Database tool (RRDtool) [6] which re-implements the 
lossy storage technique found in MRTG in a pure 
binary format for speed improvements. RRDtool also 
offers much more flexibility than MRTG, allowing 
multiple data sources per archive, varying time resolu- 
tions and non-integer values. The graphing facilities are 
similarly flexible, now allowing on-demand graphs for 
arbitrary time periods and the ability to draw multiple 


Advantages 
Ease of setup and maintenance, 
friendly web output, large user 
base 


Cricket with RRDtool — Highly configurable, web out- 
put, high performance, large us- 


er base 

Very-high performance with 
asynchronous threaded polling, 
uses SQL database for applica- 
tions to generate complex 
queries and reports, supports 
sub-one-minute polling inter- 
vals, runs as a daemon 





data sources simultaneously. The redesign, however, 
does not address the data gathering (SNMP polling) 
aspect of the performance problem and does not meet 
our requirement to retain all data samples indefinitely. 


Cricket [1] is designed to provide a manageable 
interface to RRDtool; its hierarchical configuration tree 
using inheritance works particularly well for large net- 
works. Cricket is a set of Perl scripts that gather infor- 
mation about the network topology, set up RRDtool 
properly, and use the Perl SNMP module [5] to collect 
statistics. The end result is a set of easy to navigate web 
pages with RRDtool traffic plots similar to those pro- 
vided by MRTG. We were impressed by the ease of 
configuration and useful web output of Cricket. While 
Cricket combined with RRDtool offers impressive flex- 
ibility and speed, we desired a more generic interface to 
the data and felt that there were more speed improve- 
ments to be had. Cricket with RRDtool still incurs cron 
and Perl interpreter overhead, sequentially polls devices 
and averages data samples. 


Table 1 summarizes the advantages and disad- 
vantages of the open-source tools we evaluated and 
the primary use of each. We include RTG in this table 
for comparison. While MRTG, RRDtool and Cricket 
are appropriate for different environments, none met 
our performance or collection criteria. Schemes that 
perform long-term averaging can hide link peculiari- 
ties important to engineering. Based on the availability 
and relative low cost of fixed disk storage, our design 
constraint was to keep long-term data indefinitely 
without averaging. We also recognized that it was 
impossible to anticipate every user or application that 
would need access to the data. Thus, we wanted as 
abstract and open of an interface to the data as possi- 
ble. The creation of RTG was motivated by the lack of 
suitable open-source tools and the inflexibility of 
available commercial solutions. 


RTG Implementation 


From our base requirements as a large service 
provider and our experience with other SNMP statistics 


Disadvantages Primary Use 


Performance problems = Small networks requir- 

for large networks, ing only traffic plots 

lacks flexibility and 

external interfaces 

Averages samples, in- Mid-to-large networks 

curs Perl and cron 

overhead 

Requires external Large-to-very large 

packages (Net-SNMP networks that require 

and MySQL), complex _ traffic plots, advanced 

to configure, steep reports, no data aver- 

learning curve aging and indefinite 
data storage 


Table 1: Comparison of select SNMP statistics monitoring tools. 
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packages, we developed RTG. RTG is comprised of rtg- 
poll (a polling daemon), rtgplot (a plotting program), 
rtgtargmkr.pl (a network configuration parser), a collec- 
tion of Perl reporting scripts and a set of PHP scripts to 
provide a web interface. 


The RTG system centers around the polling dae- 
mon, rtgpoll. To provide the highest performance possi- 
ble, the poller is written in C and runs as a daemon, uti- 
lizing less memory and fewer processor resources. Fur- 
ther, to allow asynchronous parallel querying and pre- 
vent any single query from blocking other polls, rtgpoll 
is fully multi-threaded A thread per SNMP query is 
used such that the poller maintains a constant number 
of “queries in flight,” greatly improving performance. 


An often overlooked performance problem lies 
in the network devices themselves. The SNMP agent 
on many IP devices consumes significant resources in 
response to queries, particularly when many objects 
are polled. To equalize the query load and prevent 
device CPU starvation which may inadvertently cause 
routing problems or service instability, RTG random- 
izes the target list before polling. In this manner, even 
if the target file lists devices sequentially, rtgpoll 
SNMP queries individual object identifiers (OIDs) of 
the devices at random. Whereas our previous SNMP 
management software inflicted noticeable CPU spikes 
on the network devices, no spikes are evident with this 
scheme. An added benefit to this randomization strat- 
egy is that should a device be physically down or 
unreachable, all of the rtgpoll threads will not block 
waiting for the device query to timeout. Thus, a single 
device that is unreachable has little or no impact on 
the overall RTG poll cycle time. 


router 





pop (int) 





interface 


description (char) 
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The rtgpoll program reads a master configuration 
file, rtg.conf, and a target file. The configuration file 
contains general RTG parameters while the target file 
contains the list of SNMP targets, SNMP communities, 
OIDs, SQL tables and other information. An auxiliary 
Perl script included in the RTG distribution, rtg- 
targmkr.pl, maintains the target file and ensures that the 
interface information in the database is kept up to date. 
rtgtargmkr reads a list of devices, SNMP fetches each 
interface name, description and speed, and maintains 
database consistency. For instance, the SNMP interface 
identifier for a particular interface can change between 
network device reboots or after adding a new interface 
to a network element. The rtgtargmkr script manages 
these changes and detects new interfaces or interface 
description changes. The target list is re-read when rtg- 
poll traps the UNIX HUP signal, allowing for dynamic 
reconfiguration without restarting the daemon. We run 
rtgtargmkr periodically via a cron job and then send rtg- 
poll a HUP signal so that RTG always maintains an 
accurate view of the network. 


Each target is SNMP polled at the interval speci- 
fied in the rtg.conf file and the result is inserted into a 
MySQL [11] database. We chose MySQL because it is 
open source, very fast, portable, has a large installed 
user base and has multiple application programming 
interfaces (APIs), including C, PHP, Perl. While it is 
technically feasible to use other databases, for instance 
for users with existing database infrastructure, this 
would require significant reprogramming of RTG. We 
are very pleased with the performance and stability of 
MySQL for RTG and have seen little interest in using 
other databases. Figure 1 presents a functional dia- 
gram of the RTG system. 


iid (int) racy | name eran speed (int) description (char) status (bool) 


iflnOctets_xxx/ifOutOctets_xxx 





iid (int) dtime (datetime) || count (bigint) 


Figure 2: RTG database schema. 


mysql> SELECT * FROM ifInOctets_9 WHERE iid=117 
AND dtime>’2002-07-14 18:50’ ORDER BY dtime LIMIT 3; 








+------ +--------------------- 

| iid | dtime 

+------ fac cece nee ese ess cees- 
117 2002-07-14 18:51:16 
117 2002-07-14 18:56:23 
117 2002-07-14 19:01:25 

posasee paar c sees peessees ss 


5092874 
5857165 
4762324 








Listing 1: MySQL query illustrating use of the RTG schema. 
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RTG can poll either 32 or 64-bit integers and 
gracefully handles counter wrap and anomalous val- 
ues. Counter wraps are detected when the SNMP 
result is less than the previous SNMP result from the 
last sample interval for a particular OID. When RTG 
encounters a counter wrap, the database insert value is 
calculated as either 

(2°? — last_value) + current_value 
or 

(2° — last_value) + current_value 
depending on the OID integer size. Unfortunately, in 
practice some counter wraps are not legitimate. Often 
if a device is rebooted between polling intervals, the 
SNMP value returned after the reboot will be less than 
the previous value RTG maintains. RTG eliminates 
these bogus data points by defining a configurable 
out-of-range value above which rtgpoll will never 
attempt an insert into the database. The out-of-range 
value is typically configured as a multiple of the maxi- 
mum number of bytes possible in the defined interval 


on the highest speed link. 


Figure 1: RTG system functional diagram. 





Configuration 
Parser pa Target List 
(rgtargenkr. pl) | 


To meet the long-term storage requirement, RTG 
utilizes the MySQL database in combination with a 
highly efficient database schema. Every effort was 
made to minimize the amount of data that must be 
stored and maximize performance. We observed that 
most reporting and analysis applications are interested 
in router or interface specific data for a particular 
object, such as byte counts, over a time range. In order 
to minimize the amount of data any single query 
would have to process, minimize the amount of data 
stored and to segment the data as much as possible, we 
created a SQL table per unique device and object. 
Each table name contains the device identifier (rid), 
i.e., ifInOctets_9. In this unconventional fashion, the 
table name becomes significant as a unique index. 
Each of these tables contains only interface identifica- 
tion (iid), date/time and count columns. The table is 
further indexed by the date/time (dtime) column. Two 
additional tables provide router and interface index 
identifiers (rid and iid) as well as descriptions and 
names. The database schema is illustrated in Figure 2. 


A potential drawback to this schema is that each 
device requires five tables corresponding to five or 
more files on the MySQL system. Because of this, the 
maximum number of devices is bounded by the oper- 
ating system and MySQL’s ability to speedily handle 
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many files. Despite this limitation, this method has 
allowed us to retain more than two-years of complete 
data for over 100 devices without performance impact; 
reports for old data are generated as quickly as reports 
for new data. While different schemas may be more 
appropriate for other installations, this database 
schema provides the highest performance in our net- 
work. It is important to note that RTG does not impose 
any requirement to use this schema. In fact, the RTG 
table names are completely configurable in the target 
list file. For instance, some installations choose to use 
only five tables total corresponding to input and out- 
put octets, packets and errors. 


Using the aforementioned schema for each unique 
network element, the interface identifier (iid), times- 
tamp and difference between the last SNMP sample and 
the current poll are inserted into the network element’s 
unique table. Assume that a user has identified a device 
and interface of interest based on the device and inter- 
face descriptions in the RTG database. If the device and 
interface in question are rtrl.someplace with a router 
identifier (rid) of 9 and interface id (iid) 117 respec- 
tively, Listing 1 shows the MySQL query (limited to 
the first three rows) to gather ifInOctet data. 


Thus, only the absolute minimum amount of data 
is stored in the database preserving speed and storage 
space. On one production MySQL server, RTG is using 
approximately 5.5 GB of data and 3.9 GB of index 
space (total of 9.4 GB) to store two-years of data. 


We do not enforce any data periodicity in the 
database; it is the responsibility of the application to 
determine the total time elapsed between subsequent 
samples for any given table and interface should the 
application need to calculate traffic rates. RTG does 
not record rates, only absolute counts. In the previous 
example query, an application would calculate the rate 
as 
4,762,324 Octets/302 sec = 15.8 KBps = 126.2 Kbps . 
Finally, RTG’s high-performance has the added advan- 
tage of allowing sub-minute polling intervals for 
instances where high sample granularity is required. 
We present a real-world example of the utility of sub- 
minute polling in the RTG Reports section. 


Performance Evaluation 


Because performance is a central component of 
RTG, we evaluated RTG, MRTG and Cricket for speed. 
All tests were performed on a dual 360 MHz Sun Ultra 
60 workstation running the Solaris 2.7 operating system. 
We used the UNIX time command to observe the total 
execution time, user CPU and system CPU times for 
each application. We measured the performance five 
times and then took the simple mean of the five test 
runs. Each application’s CPU utilization is presented in 
Table 2. While MRTG and Cricket use significantly 
more processor cycles than RTG, this data does not 
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include the CPU utilization of RTG’s MySQL server. 
Despite this, MySQL RTG CPU usage has never been a 
practical limitation in production environments and we 
note that it is possible to run the RTG polling daemon 
and the MySQL database on separate physical machines 
if needed. 


The application performance data is presented in 
Table 3. Because in normal operation RTG runs contin- 
uously as a daemon, we modified the code to exit after 
five polling cycles thereby allowing us to use the UNIX 
time utility. The data shows that Cricket with RRDtool 
is far superior to MRTG. Cricket and a non-threaded 
version of RTG are comparable in speed, although RTG 
uses fewer CPU resources. Finally, the multi-threaded 
version of RTG is by far the fastest application in the 
group achieving approximately 107 targets per second 
in our testing. Assuming the traditional five-minute 
sample interval, this yields a theoretical maximum of 
32,000 OIDs monitored on a single RTG system before 
saturation, almost five times as many as Cricket and 
twenty-four times more than MRTG. 


Whereas with other systems it is not possible to 
query just byte statistics on the entire network within the 
sample period, the speed of RTG allows us to not only 
monitor all devices in the network, but also to monitor 
additional objects per interface. For example, we now 
monitor the SONET Management Information Base 








MRTG 1618 





RTG-threads 3650 









Table 2: Application 


Application Targets UserCPU(s) System CPU(s) CPU% 


113.8 19.0 


Cricket 2010 21.3 0.7 25.1 
RTG 2255 13 0.7 0.6 
LT 
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(MIB) [10] to proactively track transmission problems 
and the MPLS MIB [9] to analyze Label Switched Path 
(LSP) traffic. Every five minutes, our production RTG 
system processes approximately 5000 OIDs in approxi- 
mately 60 seconds leaving ample room for future 
growth. We suspect that even higher performance is pos- 
sible by increasing the number of threads, and hence the 
number of queries in flight, beyond the default of five. 
Because the performance is more than acceptable, we 
are hesitant to increase the number of threads on our 
production RTG for fear of overwhelming the network 
or the network devices. We note also that the architec- 
ture of RTG easily allows separation of the various 
components. For instance, multiple instances of the 
RTG poller could be distributed throughout the network 
while utilizing separate physical machines for MySQL 
and web page or report generation to achieve even 
greater scalability. 


RTG Reports 


Our experience shows that using a SQL database 
provides an ideal abstract interface to the data. The 
available Perl, PHP or C API’s facilitate rapid prototyp- 
ing and allow for the design of highly customized 
reports and tools. In our case, the engineering group 
currently receives a nightly traffic report including total 
bytes, packets, maximum rate and average rate for the 
backbone routers via a scheduled Perl DBI script. Each 





36.1 


Ld 0.8 












CPU utilization. 









Application Target’s Run Time(s) Sec/Target Targets/Sec Max Targets in 5 min 
MRTG 1618 365.4 0.23 4.43 1328 
Cricket 2010 87.8 0.04 22.89 6868 


RTG 2255 0 0.03 29.06 8717 


RTG-threads 3650 34.2 0.01 106.73 32018 











Table 3: Application performance. 


Traffic Daily Summary 


Period: [01/01/1979 00:00 to 01/01/1979 23:59] 

Site GBytes In GBytes Out MaxIn(Mbps) MaxOut Avgin AvgOut 
rtrl.someplace: 

so-5/0/0 384.734 360.857 49.013 43.420 35.630 33.426 
so-6/0/0 357.781 421.736 42.923 50.861 83.1137 39.053 
t1-1/0/0 0.054 0.058 0.005 0.006 0.005 0.005 
rtr3.someplace: 

so-6/0/0 Lg Lda 2os 1,246.163 168.776 172.690 103.173 115.439 
so-3/0/0 1,142.903 1,028: 256 152: 232 162.402 105.863 95.142 
so-7/0/0 152.824 199.742 22: \OD2 35.005 14.152 18.488 


Listing 2: RTG summary traffic report. 
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morning engineers can peruse this email for interesting 
events that may require attention. Listing 2 shows 
example output from the Perl traffic summary report 


Network Traffic 








Nbits/s 
BPRS 


~ 
S 





0303 2:26 4:50 7:13 9:37 12:01 14:24 «16:48 §=19:12, 21:35 23: 
04/01 
GiflrOctets_1i1: (2.0 T) 337.0 M_/ 187.9 M_/ 99.1 Mbits/s (max/avg/cur) 
MifOutOctets_iii: (2.2 T] 352.5 M / 203.1 M / 116.0 Mbits/s (max/avg/cur) 





¥ 


Mhits/s 
+ 
8 


3 





04/01 04/03 04/05 04/07 04/09 O4/11 04/13 04/15 04/17 04/19 04/21 


GiiFinOctets_111: (38.9 T] 355.5." / 167.4 M_/ 309.9 Mbits/s (max/avg/cur) 
MifOutOctets_i11: (43.3 T] 34.5 M / 186.0 NM / 353.5 Moits/s (max/avg/cur) 





Figure 4: RTG plots for different time scales with no 
loss of resolution. 


included in the RTG distribution. Because no averaging 
is used, absolute numbers such as the total number of 
Gigabytes are shown and the report result for a specific 
time period will be the same regardless of when the 
report is generated. This consistency is invaluable for 
accurate trending and accountability. 


Because the data is stored in a relational database, 
it is straightforward to generate a traffic report for a 


ABC Industries Traffic 
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single customer or a subset of customers for any arbi- 
trary time period. Listing 3 depicts output from the 95th 
percentile Perl report included with the RTG distribu- 
tion. This report shows a customer’s usage on three cir- 
cuits including their 95th percentile rate, a metric often 
used for billing in the telecommunications industry. 


Another Perl script we use regularly generates 
year-to-date trunk utilization data in comma separated 
value (CSV) format, a format that is easily imported 
into commercial spreadsheets. These reports are used 
by capacity planning groups and presented to upper 
management. An example graphic of year-to-date traf- 
fic generated by plotting the CSV data in a spreadsheet 
is shown in Figure 3. 








0:34 1:30 2:26 3:22 4:19 5:15 6:11 7:08 8:05 9:00 9:56 


1 
ifInOctets_10: [142.2 6) 63.4 M / 33.6 M / 16.2 Mbits/s (max/i cur) 
ifOutOctets_10: (7.2 G) 14.7 4 / 1.7 M / 148.5 Kbits/s (max/avg/cur) 









thits/s 





, i 
O:4 1:30 2:27 3:23 4:20 317 6:13 7:10 8:06 9:03 9:59 
8/31 


5:1 
Gif inOctets_20: [142.0 G6) 140.3. / 33.4 M / 3.9 Kbits/s (max/. cur) 
MifOutOctets_20: [7.2 G} 109.6 MH / 1.7 M / 856.27bIts/s (max/avg/cur) 





Figure 5: Effect of differing sampling rates measuring 
the same circuit and time period using 5 minute 
sampling (top) and 30 second sampling (bottom). 


RTG includes a set of PHP web pages that pro- 
vide a graphical view of circuit utilization for cus- 
tomers and support staff for any arbitrary time period. 
A key application included in the RTG distribution for 
web pages and traffic visualization is rtgplot. rtgplot is 
a C program that utilizes the GD library [2] to generate 
plots, similar to those generated by MRTG, in PNG for- 
mat from the RTG database. rtgplot provides an 
extremely fast on-demand graphical interface to the 
data. rtgplot can be used as a stand alone application or 
embedded in web pages. A traffic plot can be placed 


Period: [01/01/1979 00:00 to 01/31/1979 23:59] 


RatelIn RateOut 
Connection Mbps Mbps 
at-1/2/0.111 rtr-1l.chi 0.09 0.07 
at-1/2/0.113 rtr-l.dca 0.23 0.19 
at-3/2/0.110 rtr-2.bos 0.11 0.16 


MaxIn MaxOut 95% In 95% Out 
Mbps Mbps Mbps Mbps 
0.65 0; 22 0.22 OSES 
1.66 L.i2 0.89 O57 
0.34 0.56 0.26 0.40 


Listing 3: Customer 95th percentile traffic report. 
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easily in any web page by simply using the <IMG> 
HTML tag with the appropriate arguments. For instance 
<IMG SRC="rtgplot.cgi?tl=ifInOctets_2& 


t2=ifOutOctets_2&iid=49 &begin=1028606400%& 


end=1028692800&units=bits/s&factor=8& 

scalex=yes"> 
will plot two lines of traffic data from the RTG 
MySQL tables iflnOctets_2 and ifOutOctets_2 corre- 
sponding to interface id 49 on router id 2 for the 24 
hour period 1028606400 to 1028692800 (UNIX sec- 
onds since the epoch). The ‘factor’ argument multi- 
plies the byte data to produce bits per second output on 
the plot, while the ‘units’ argument will be displayed 
on the vertical axis. The ‘scalex’ argument will auto 
adjust the horizontal time axis according to the avail- 
able data samples rather than according to the actual 
time span specified. Non-continuous data, such as 
errors, are plotted using the ‘impulses=yes’ argument. 
Different time views, with no resolution decay, of the 
same circuit monitored by RTG are shown with rtgplot 
output in Figure 4. Note that each plot includes the 
absolute volume of bytes as well as the maximum, 
average and current traffic rates for the time range. 


Finally, the utility of an SNMP tool that is 
extremely fast and supports sub-one minute polling is 
underscored by a recent example where a customer 
OC-3c circuit (155 Mbps) was experiencing perfor- 
mance degradation due to packet loss. Quickly examin- 
ing the RTG plot for the circuit did not immediately 
reveal any congestion. We then configured RTG to poll 
this interface every 30 seconds rather than every five- 
minutes. Figure 5 shows two plots of this interface over 
the same time period. The upper plot is the result from 
polling every five minutes whereas the lower plot is the 
result from polling every 30 seconds. Clearly the five- 
minute polling interval masked the customer’s traffic 
bursts that were causing packet loss. While the plot 
generated from five-minute samples shows a peak input 
rate of 63.4 Mbps, the plot generated from the 30-sec- 
ond samples shows a peak input rate of 140.3 Mbps. 
Data averaging would mask this type of problem even 
further, particularly as the data aged. 


Future RTG Development 


We are continuing to develop RTG and improve it 
based on feedback from the open-source community. In 
particular we are looking to increase the robustness of 
RTG by employing a buffering mechanism to buffer 
SNMP results in case the SQL database is down or 
unreachable. In addition, we want to develop function- 
ality by which multiple RTG clients can communicate 
with one another to provide distributed polling and 
redundancy. We recognize that setup and installation of 
RTG is difficult and we plan to improve the configura- 
tion utilities, documentation, etc. Finally, we have 
received feedback about additional uses of RTG includ- 
ing implementing multi-grain storage techniques, as 
opposed to the traditional fixed sample interval, to 
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isolate interesting variations in the data. This could 
potentially lead to the development of a denial-of-ser- 
vice detection or fault management system. 


RTG Availability 


RTG is developed on Solaris, tested on FreeBSD 
and Linux, and should run on a wide variety of other 
UNIX platforms by virtue of a GNU autoconf script. 
There is no support for Windows platforms. RTG is 
available under the terms of the GNU GPL from the 
RTG web page hosted on SourceForge, http://rtg. 
sourceforge.net. Further information, including docu- 
mentation, and mailing lists can be found on the RTG 
home page. 
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ABSTRACT 


Maintaining configurations in heterogeneous networks poses complex problems. We observe 
that medium and large networks exhibit many contextual relationships, and argue that modeling 
these relationships explicitly simplifies configuration management. This paper presents a 
declarative data specification language, called Anomaly, that implements our ideas. Anomaly 
models containment relationships and uses a data aggregation technique called environmental 
acquisition to simplify system management. The interpreter for the language generates and 
deploys configurations from a source code description of the network and its hosts. 


Introduction 


Configuration management is an important task 
for system administrators, as it is often one of the 
largest portions of their job. In the last decade, a num- 
ber of configuration management systems have 
emerged to help with this task. Some of this work is 
summarized in a paper by Remy Evard [1]. 


Our paper presents an attempt to take some of 
the best features of previous solutions and use them in 
a framework based on environmental acquisition, an 
abstraction mechanism borrowed from the object-ori- 
ented systems community. The result is a declarative 
language, called Anomaly. Anomaly is also intended 
to be useful in practice as well as in theory. It has a 
plug-in interface for adding extensions. 


Previous Solutions 


A survey of previous solutions to the problem of 
generating and managing configurations reveals three 
particularly important concepts: data aggregation tech- 
niques such as inheritance and class systems; logic 
programming techniques that reduce the complexity of 
configuration statements; and database-driven systems 
that store configuration data in a repository. 


Cfengine [4] is one widely used solution. It is a 
language-based host configuration tool that uses a 
class system to aggregate configuration commands. 
Cfengine’s class system allows an administrator to 
apply configuration statements to a class of machines. 
It uses techniques similar to logic programming to 
make its statements more concise. Couch and Gilfix 
demonstrated this in their 1999 paper, “It’s Elemen- 
tary Dear Watson...” [5]. Cfengine is not perfect, 
however. Its host-centricity does not lend itself well to 
other components of the environment, especially 
switches and routers. 


Language-based configuration tools are an 
important contribution, but for large networks, config- 
uration data can become unwieldy when stored in 


2002 LISA XVI — November 3-8, 2002 — Philadelphia, PA 


source code form. Database-driven approaches [2, 3] 
solve this problem by providing a repository for con- 
figuration data. This approach simplifies the mainte- 
nance of configurations and reduces the rate of errors 
by reducing the amount of source code that adminis- 
trators have to deal with. 


The paper by Couch, et al., [5] also presents an 
approach that leverages convergent processes based on 
logic programming (Prolog). The authors point out 
that previous approaches, especially Cfengine, already 
use convergent processes. An example of this is the 
Cfengine link command, which looks like: 


links: 
/etc/sendmail.cf ->! mail/sendmail.cf 


In Cfengine, the link command (like most other 
commands) hides a great deal of housekeeping from 
the administrator. The link command above takes care 
of checking whether /etc/sendmail.cf already exists as a 
link to mail/sendmail.cf, or if it already exists as a non- 
link file, and takes appropriate actions to make 
/etc/sendmail.cf a link to mail/sendmail.cf. To do the same 
in Bourne shell might take a dozen lines. 


This process is called a convergent process 
because executing the link statement multiple times 
does not have side-effects after the first execution (i.e., 
it is idempotent). Also, the link statement behaves 
appropriately regardless of the initial state of the sys- 
tem. Therefore, the state of the system converges to 
the ideal state described by the source. 


Environmental Acquisition 


The fundamental insight of this paper is that all 
network objects exist in contexts. The idea of context 
dependency is both powerful and pervasive. Every 
computer, every switch, every printer in a network 
exists in a context that affects its desired behavior. In 
other words, the physical and logical location of a net- 
work object determines properties of that object. Here 
are some concrete examples of network objects and 
their context dependencies: 
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¢ Access Controls. A host in a computing lab 
must have less restrictive access controls than a 
host in a data center. Here, the physical location 
of the host is part of its context. 

e IP Configuration. The netmask and default 
route of a host depend on the subnet to which 
the host belongs. Here, the subnet is the context 
of the host. 

¢ Switch Configuration. Individual ports on net- 

work switches often require their VLAN mem- 

bership to be set by an administrator. In net- 
works where each subnet uses a_ separate 

VLAN, a switch port may depend on its con- 

text, i.e., the subnet to which it belongs, to 

determine its VLAN membership. 

Printer selection. Each computer lab or office 

in a building may have its own printer. This 

printer then becomes part of the context of each 
host in the room. Put simply, the most sensible 

thing for a host to do by default is to use a 

printer that has been assigned to the room in 

which the host is located. This can be accom- 
plished by looking for a default printer in the 
physical context of each host. 


System administrators use this contextual infor- 
mation almost every time they configure a network 
element such as a host, switch, or printer. To our 
knowledge, no tool exists to model this contextual 
information and automatically generate configurations 
based on it.! Therefore, the goal of Anomaly is 
twofold. First, Anomaly should encourage administra- 
tors to think actively about the contextual relationships 
in their network. Second, Anomaly should be a tool 
for modeling these relationships explicitly, in order to 
simplify the maintenance of configurations. 


One method of modeling contextual information 
is to treat contexts as containers and construct a set of 
containment relationships regarding network objects. 
Using this idea, one notices relationships such as “the 
machine chorf is in room 201,” and ‘the machine 
ambler belongs to subnet 192.168.7.0/24.” From here, 
it is easy to observe that objects acquire properties 
from containers. 


The name for this process is environmental 
acquisition [8]. Environmental acquisition, or simply 
acquisition, is an analogue of inheritance that operates 
on  object/container relationships, rather than 
class/sub-class relationships.? The following example, 
paraphrased from the original paper on acquisition [8], 
illustrates the idea perfectly. 


1Cfengine does have a notion of context, in that it condi- 
tions its actions based on probes of the filesystem and oper- 
ating system. However, it does not emphasize modeling con- 
textual relationships, nor does it propose a single method of 
examining context, such as the construction of containment 
graphs. 

2The Zope application server (http://www.zope.org/), the 
most prominent use of environmental acquisition, uses envi- 
ronmental acquisition to build and manage web content. 
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Consider a red car. If someone asks you about 
the color of the car’s hood, you will certainly tell 
them that the hood is red, unless you know other- 
wise. In this situation, the hood has acquired its 
color from the car. However, it would be wrong 
to model this relationship using inheritance, 
because a hood is not a type of car. A hood is a 
part of a car, that is, car and hood have an 
object/container relationship across which prop- 
erties are acquired. 


Through the use of acquisition, administrators 
can build models in which hosts acquire their access 
restrictions from their physical location, and_ their 
default routes from the subnet they reside in. 


Many previous solutions use inheritance-like 
mechanisms for data aggregation. Although inheri- 
tance has worked well in previous approaches, it is not 
the most natural or useful data aggregation mechanism 
available. 


First, the purpose of inheritance is not to aggre- 
gate data. At best, it can be considered a feature (or 
behavior) aggregation technique. Moreover, inheri- 
tance is used to establish is-a relationships. In the 
approaches discussed above, this rule is bent slightly 
(to great benefit, of course). For example, Cfengine’s 
class system uses boolean set operations (logical AND 
and OR) to decide to which machines a given action 
should be applied. The following statement says “link 
/etc/passwd-link to /etc/passwd on all machines that are 
in classes solaris and guest.” 

solaris.guest:: # logical AND 
/etc/passwd-link -> /etc/passwd 


This approach is loosely based around the idea of 
inheritance. Objects are instances of specific classes, 
and the class which an object belongs to determines its 
behavior. However, inheritance is not an appropriate 
technique for data aggregation in system management. 
Acquisition is a better approach to data aggregation 
for the following two reasons. 


First, acquisition encourages system administrators 
to think about containment and contextual relationships. 
This is an improvement over inheritance, which encour- 
ages thinking in terms of is-a relationships. While it may 
be true that the host chorf is-a Solaris host, chorf also has 
interesting relationships with its surroundings. It is in a 
room, and it is part-of'a subnet, but neither of these rela- 
tionships lends itself to inheritance. 


Second, acquisition has finer granularity than 
inheritance. Inheritance demands that a child class 
inherit all the features of its parent. A child class may 
choose to override certain features that it inherits, but it 
may not decline them outright. This is necessary because 
inheritance imposes sub-type relationships. A child class 
must have all the behavior of its parent, or else the is-a 
relationship is nullified. Because acquisition does not 
impose sub-typing, an object may pick and choose which 
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features to acquire from its container. This prevents 
objects from acquiring unexpected properties. 


There is at least one other twist on inheritance 
that bears mentioning, namely value inheritance. 
Value inheritance is a data aggregation technique used 
in prototype systems such as ARK [6]. It works by 
making copies of a prototype object, and then making 
tweaks to the values inherited from the prototype. 
Most of the examples presented in this paper could 
also be implemented using value inheritance. We 
believe, however, that environmental acquisition is the 
better choice for the reasons discussed above. It 
encourages thinking in terms of contextual relation- 
ships, and it provides better granularity than value 
inheritance. David Ungar’s paper on SELF [10] dis- 
cusses prototyping and value inheritance in depth. 


Logic Programming 


Anomaly does not have the power of a full logic 
programming language such as Prolog. Cfengine’s 
logic programming abilities also exceed that of 
Anomaly’s. In Anomaly, the ability to specify the tar- 
gets of configuration statements with ‘facts’ is 
replaced by the approach of context modeling and 
environmental acquisition. However, Anomaly keeps 
what we believe is the most important contribution of 
Cfengine, namely, the idea of convergent processes 
specified by logical statements. Therefore, rather than 
using imperative statements such as ‘“‘add this user”’ or 
“put this interface in promiscuous mode,” Anomaly 
uses logical assertions such as ‘‘this user must exist,” 
or ‘“‘this interface must be in promiscuous mode.” 


Anomaly 


Our language, Anomaly, combines important ele- 
ments of prior approaches with environmental acquisi- 
tion. Anomaly has constructs to model containment 
relationships (i.e., which objects are contained in 
which objects), and constructs for describing the ideal 
state of individual objects. 


Examples 


The following examples illustrate the use of 
Anomaly and highlight its strengths, by examining 
several typical system administration scenarios. Sev- 
eral different types of contextual relationships are pre- 
sented, in order to demonstrate the benefits of explic- 
itly modeling context. 


Host Access Controls 


Configuration of host access controls is the most 
basic and obvious use of acquisition in system man- 
agement. A system administrator can make access 


3Acquisition comes in two flavors, implicit and explicit. 
When using implicit acquisition, any attempt to access a 
variable that is missing from an object results in an attempt 
to acquire that variable. When using explicit acquisition, no 
attempt is made to acquire variables unless they are listed 
explicitly as candidates for acquisition. Anomaly uses ex- 
plicit acquisition. 
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controls more maintainable by exploiting a contextual 
relationship in the environment. In a university setting, 
it may be appropriate to use the physical location of a 
host as the context. In a corporate setting, the appro- 
priate context may be the ownership of the machines 
(e.g., departmental, individual). In this example, we 
use the physical location as the context. 


The first consideration is the containment graph. 
It is specified in Figure 1. Figure 2 defines some of 
these objects. The object CullinaneHall is defined as a 
Building, and is given a default access policy that 
allows only users in the systems netgroup to log in. 
This means that any host in the building acquires a 
restrictive access policy, unless otherwise specified. 





CullinaneHall contains { 
UnixLab; 
DeansOffice; 

} 


UnixLab contains { 
chorf; 
wharf; 
staypuff; 

} 


DeansOffice contains { 
deans-laptop; 
} 


Building CullinaneHall { 
AccessPolicy access; 


access.allows(systems) ; 


} 


Room UnixLab { 
AccessPolicy access; 


access.allows(students) ; 
} 


SolarisHost chorf { 
AccessPolicy access(’/etc/passwd’) 


acquire access 
} 


Figure 1: Controlling access policies in object defini- 
tions. 


For the UnixLab object, that is exactly what is 
done. Since the UnixLab object represents a public 
computer lab, it must have an access policy that 
allows users from the students netgroup to log in. 
Therefore, any host placed in the UnixLab object 
acquires a permissive access policy, specifically one 
that allows students to use the computers. 


From this example, it is easy to see the benefits 
of using this method across an entire environment. 
When every room in a building has a sensible access 
policy assigned to it, administrators hardly have to 
worry about individual hosts. This directly addresses 
the recurring problem of hosts moving from faculty 
desks to public labs (Or from similar restricted access 
locations to similar public access locations). If no one 
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remembers to edit /etc/passwd, the result is that stu- 
dents cannot log into the machine. 


The most important benefit of this scheme is that 
even if a forgetful administrator moves a machine 
from one place to another without consideration for its 
access policy, the machine acquires a sensible access 
policy from its new environment. Modeling the con- 
text of the machine has thus made the path of least 
work more likely to be the correct path. In short, the 
lazy way produces the best defaults. 


SwitchPort zaphod-8-13 { 
SwitchNum switchnum(switch = 
’zaphod.ccs.neu.edu’); 


switchnum.is(’8/13’); 


acquire vlan; 
acquire ether; 


} 
SolarisHost chorf { 
MAC ether; 


ether.is('01:1C:ED:CO:FF:EE’); 
} 


Subnet UnixSubnet { 


NetworkAddr net; 
Netmask mask; 
VLAN vlan; 


net. is? 10.10.2116") 4 
mask.is(’255.255.254.0’); 
vlan.isnamed(’116’); 


} 
UnixSubnet contains { 


chorf; 


) 


chorf contains { 
zaphod-8-13; 
} 


Figure 2: Modeling a switch Port. 





Switch Port Configuration 


Switch port configuration often involves config- 
uring VLAN membership and port security settings. 
The contextual relationship is not quite as obvious as 
in the previous example. Here, it is necessary to treat 
the switch port as an object contained in the host that 
is attached to it. From a physical standpoint, this 
seems inappropriate. However, from a logical stand- 
point it is easier to think of the host being part of the 
port’s context. When using MAC-based port security, 
it is necessary to know which host the switch port is 
supposed to serve. 


Figure 2 shows the containment and object dec- 
larations for chorf and its switch port. When the con- 
figuration in Figure 2 is built, the switch port acquires 
two fields: vian and ether. The vlan field allows the 
switch port to set its VLAN membership properly, and 
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the ether field allows it to set its port security settings 
properly. After the configuration is deployed, the 
switch will be usable only by chorf. 


This figure also raises a question about the dif- 
ference between the ‘is’ assertion and the ‘isnamed’ 
assertion. The former is used when an assertion 
describes the state of the variable. In this example, the 
ether field is completely specified by a 48 bit address, 
that is, the ether field is its address. The latter asser- 
tion, isnamed, is used when referring only to the identi- 
fier of a field. The vlan field in this example is only 
concerned with the identifier of the VLAN, not with 
configuration of that VLAN on the switch. 


Beta-Test Environments 


When upgrading subsystems such as daemons, 
kernels, or user applications, it is best to test the new 
software on a small subset of machines. It is possible 
to test upgrades on a single machine near the adminis- 
trators’ offices, but a single machine may not be repre- 
sentative of the rest of the environment (especially if 
host hardware is substantially varied throughout the 
environment). Furthermore, a designated test machine 
is not likely to have the same usage patterns as most 
machines in the environment. Anomaly can assist with 
this problem by providing a better method for organiz- 
ing beta test systems, using environmental acquisition. 


Platform Solaris ( 
Packages packages; 
Patches patches; 


patches.has(’sun-recommended’) ; 
packages.has(’openssh-3.0.2p1’); 
packages.has(’lprng-3.6.14'); 

} 


Platform Beta { 
Packages packages; 
Patches patches; 


patches.has(’sun-recommended’) ; 
patches.has(’108604-18"); 
packages.has(’openssh-3.1pl’); 


Figure 3: Platform objects. 


The containment relationships described in Fig- 
ures 3 and 4 simplify beta testing. In order to test a 
new Solaris patch (108604-18), we add it to the Beta 
container. This causes the patch to be installed on all 
machines contained in Beta. When the patch is verified 
to work properly on all Beta machines, the patch can 
be moved from the Beta container up to the Solaris 
container, which causes it to be installed on all 
machines. 


This approach offers several advantages: 

e Because Beta is contained in Solaris, all other 
fields in the Solaris container are acquired by 
the machines in Beta. Therefore, the configura- 
tions of these machines will stay as close as 
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possible to the standard configuration being 
used by the rest of the environment. 

e Because all hosts for testing the new configura- 
tion are aggregated into one container, it is easy 
to keep track of which machines are being used 
as test cases. 





Solaris contains { 
north-star; 
dog-star; 
## and many, many others 


Beta; 
} 


Beta contains { 
## machines of various 
## hardware configurations 
chorf; 
emerald-city; 


Figure 4: Beta test containment. 





EthernetInterface Interface { 


acquire mode; # declared in network 
acquire ethaddr; # declared in hosts 


} 


SwitchPort switchport { 
SwitchNum switchnum(switch = 
*zaphod.ccs.neu.edu’) ; 


switchnum.is(’8/13’); 


acquire vlan; 
acquire ether; 
acquire smode; 


} 


Network CCSNetwork { 
IfMode mode; 
SwitchPortMode smode; 
mode.is(’normal’); 
smode.is(’normal’); 


Figure 5: Interface declaration. 





e The beta test containment relationship is 
orthogonal to other containment relationships 
such as physical location, network (logical) 
location, and ownership. Therefore, the inclu- 
sion of a machine into the beta test container 
does not affect the usage patterns of the 
machine. 

e When no upgrades are being tested, the 
machines in the beta container acquire exactly 
the same settings that machines outside the beta 
container acquire. Therefore, the addition of the 
beta container does not affect the environment 
in any way when it is not being used. In other 
words, the container becomes completely trans- 
parent when nothing is being tested. 


While it might seem that the ‘has’ assertion in 
this example must embody all of the functionality of 
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Cfengine, it is not nearly that complex. It uses a direc- 
tory that contains all the Solaris patches currently in 
use in the environment. In it there is a subdirectory 
named ‘2.8 recommended.’ Other patches are listed 
individually. The has assertion simply checks if a 
patch is installed, and if it isn’t, runs the patch’s install 
script. The packages field works similarly. This 
method, of course, cannot handle all of the boundary 
cases handled by Cfengine. 


Reparenting 


Modeling context through containment is the 
central theme of Anomaly. The following example 
demonstrates how changes in context and containment 
result in appropriate changes in the network, with a 
minimum of reconfiguration. 


Consider the containment graph in Figure 6. The 
object declarations for individual hosts and unman- 
aged containers are not shown, but are similar to the 
declarations used in Figures 8 and 1. The one unfamil- 
iar object in this graph is Interface, which represents 
Ethernet interfaces on Unix hosts. Its declaration is 
shown in Figure 5; it is placed into its containers with 
copy containment. Figure 5 also shows the declaration 
of CCSNetwork, which specifies a mode (promiscuous 
or normal) which is acquired by all interfaces. 


Suppose that the administrator of this network 
wishes to start running host-based IDS software. This 
can be accomplished without modifying the declara- 
tions of individual hosts or interfaces at all. The 
administrator simply adds an IDS container and repar- 
ents the appropriate objects as shown in Figure 7. 
Now, all IDS-related settings are aggregated into a sin- 
gle container. To make a machine an IDS machine, it 
needs only to be added to the IDS container. By 
acquiring settings from the IDS container, the machine 
receives the IDS packages, the machine’s interface 
becomes promiscuous, and the switch port attached to 
the interface is put into spanning mode. 


By modeling context through containment, we are 
able to aggregate control of three network components 
(host software, host interface, and switch port) into a 
single point of control. Examples such as this one make 
changes to configurations atomic (i.e., the desired 
change can be effected by moving one machine into one 
container), and thus more maintainable. 


The Language 


Anomaly is a simple object-oriented declarative 
language. Its basic components are: 
¢ Objects. Objects in Anomaly are used to model 
components of a network. They can be divided 
into two categories: 

e Managed Objects. Managed objects repre- 
sent elements of the network that are 
directly managed by system administra- 
tors, e.g., computers, switches, and print- 
ers. 

e Unmanaged Objects. Unmanaged objects 
are components of the network that are not 
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directly managed as part of the network, 
e.g., buildings, rooms, and research 
groups. 

A simple Anomaly object appears in Figure 8. 





# unix platform 
Unix contains { 
Linux; 
Solaris; 
OpenBSD; 

} 


Linux contains { 
darkside; 

} 

OpenBSD contains { 
runningboar; 

} 

Solaris contains { 
chorf; 
ambler; 
north-star; 


} 


# a building 
Cullinane contains { 


UnixLab; 
MrRoom; # machine room 


} 


CCSNetwork contains { 
Subnet115; 
Subnet116; 
Subnet118; 

} 


Subnet115 contains { 
runningboar; # a host 
} 


Subnet116 contains { 
chorf; 
ambler; 
north-star; 


} 


Subnet118 contains { 
darkside; 
} 


ContainmentTemplate ( 
QUERY = "SELECT hostname, \ 
switchport FROM hosts;" 
NAME = hostname; 
) contains { 
copy Interface contains { 
QUERY.switchport; 
} 
} 


Figure 6: Containment relationships in a small net- 
work. 





° Fields. Fields store the configuration data for 
objects. In Figure 8, fields include hostname, ip, 
and accesspolicy. 

e Parameters. Fields may be parameterized. In 
Figure 8, the field accesspolicy specifies that 
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/etc/passwd is the location of the file it generates. 
The purpose of parameters is to tell Anomaly 
where data must go. Another example of a 
parameter is the name of a network interface 
object, e.g., hme0 or eth1. 

Assertions. In Figure 8, ip = ‘129.10.117.177’: is 
an assertion about the value of the field ip. It 
says: “The value of ip is 129.10.117.177.” 
Assertions are implemented by the fields that 
use them, not by the core of Anomaly. There- 
fore, two different types of fields (e.g., 
AccessPolicy, Packages) may have completely 
different implementations of the ‘has’ assertion. 





Platform IDS { 
IfMode mode; 
SwitchPortMode smode; 


packages.has(’snort’); 
packages.has(’acid’); 
mode.is(’promiscuous’); 
smode.is(’spanning’); 


acquire packages; 


} 


CCSNetwork contains { 
IDS; 

} 

OpenBSD contains { 
LDS 

} 


IDS contains { 
runnnigbear; 


} 
Figure 7: IDS declarations. 





SolarisHost chorf { 
Fqdn hostname; 
IPaddr ip; 
Access accesspolicy(’/etc/passwd’ ) 
Ap St UTR29 slOn tt 7s LT? 
acquire resolv; 
acquire accesspolicy; 
acquire mounts; 


Figure 8: Example object. 





* Acquisitions. In Anomaly, the keyword acquire 
signifies the acquisition of a field. Acquisitions 
tell Anomaly to search for the value of that field 
in the containers of an object. It is important to 
note that accesspolicy is an acquired field, yet it 
has a parameterized declaration. In this example, 
accesspolicy acquires its value from its containers, 
but retains the parameters specified by its object. 
The other acquired fields in this example result in 
the acquisition of the value and the parameters 
from the containers, since no declaration is speci- 
fied in the object. 

Containment declarations. A simple contain- 
ment declaration appears in Figure 9. The three 
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blocks in the example declare that chorf and 
ambler are part of Subneti16, chorf is in 
Room201, and ambler is in Room129. 

Database templates. While language-based 
configuration approaches have many merits, 
storing configurations for hundreds of machines 
in source code form can be unwieldy. Database 
templates are provided in order to integrate 
Anomaly with configuration database back-ends, 
by allowing for the creation of arbitrarily many 
objects in only a few lines of code. 


Subnet116 contains { 
chorf; 
ambler; 


} 


Room201 contains { 
chorf; 
} 


Rooml129 contains { 
ambler; 


} 
Figure 9: Example containers. 


Semantics of Containment 


Objects in Anomaly may have multiple, non- 
nested containers. In fact, most do. For example, a 
host usually belongs (at minimum) to both a room and 
a subnet at the same time. Yet, the subnet is not in the 
room, and the room is not in the subnet. 


During the design process, we considered requir- 
ing that the containment graph take the form of a for- 
est whose trees meet only at the leaves. This decision 
turned out to be unduly restrictive. Therefore, the only 
requirement Anomaly places on the containment 
graph is that it be directed and acyclic, that is, objects 
may not contain themselves, directly or indirectly. 
Users of Anomaly can construct complex containment 
graphs using this rule. 


When attempting to model certain containment 
relationships in this way, there are situations in which 
one would like to use the same object in many loca- 
tions. For example, if network interfaces are repre- 
sented by objects, they will most likely be identical 
across a wide group of machines. However, we cannot 
simply declare one interface object, and place it in 
each host. Doing so would create a containment graph 
like the one depicted in the left half of Figure 10, 
where the objects in the top layer all contain the same 
interface object. When the interface object attempts to 


acquire its netmask, it finds a netmask field in each of 


its parents, and compilation fails. 
To remedy this situation, Anomaly offers copy 


containment. Copy containment reduces the number of 


duplicate object declarations by allowing a single dec- 
laration to be used in any number of places. When a 
containment relationship is declared as copy contain- 
ment, the contained object is cloned, and the copy is 
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placed into the container. This results in the contain- 
ment graph in the right half of Figure 10, where the 
gray objects are all clones of an original object. This 
approach was inspired by mixins, an alternative 
approach tomultiple inheritance [9]. to multiple inheritance [9]. 


a laine 


Figure 10: A simple containment graph with and 
without copy containment. 


Semantics of Acquisition 

Anomaly uses references to represent contain- 
ment links between containee and container. When an 
object explicitly acquires a field, using the acquire 
statement, Anomaly searches for the field in the cur- 
rent object. If the field is not found, the containers of 
the object are searched recursively, until the field is 
found. 


Because objects may have multiple containers, 
Anomaly must account for acquisition from multiple 
containers. If the acquisition search reveals that a vari- 
able can acquire its value from two different contain- 
ers, Anomaly reports an error. This condition is called 
an acquisition conflict, because two equally valid con- 
texts have been found. 


As Anomaly attempts to resolve an acquisition, it 
may find that an object’s container has attempted to 
acquire the same field. When this scenario occurs, the 
acquisition in the container is resolved first, and then 
the acquisition in the containee is resolved, using the 
value that has now been inserted into the container. 


While this intermediate step looks superfluous at 
first glance, it is necessary to prevent ambiguities 
about the origin of a field’s parameters. Recall that a 
field may specify its parameters, but acquire its value 
from the container. Consider three objects that are 
contained one within the other, like Russian dolls. 
Suppose that the outer object declares a field called 
accesspolicy, and that the inner two objects acquire that 
field. In the case where the outer object and the middle 
object specify different parameters (e.g., /etc/passwd 
and /etc/shadow), the semantics described in the above 
paragraph force the innermost object to acquire its 
parameters from the middle object rather than the 
outer object. In other words, this rule resolves any 
ambiguity about where a field’s parameters are 
acquired from. 


The Dependency Graph 


In large configurations, it is both time consuming 
and inconvenient to re-deploy every configuration on 
the network when only a few objects have been modi- 
fied. Therefore, Anomaly uses a simple strategy to 
deploy configurations only to those objects that may 
have changed since the last update. This strategy 
exploits the structure of the containment graph. 


181 


Environmental Acquisition in Network Management 


The containment graph, in addition to modeling 
contextual relationships, also models dependency rela- 
tionships. Using acquisition, a change to the configu- 
ration of one object can affect only that object, and 
any objects contained within. It is not possible for 
changes in an object to have any effect on its contain- 
ers. To exploit this invariant, Anomaly keeps track of 
which objects have been altered since the last success- 
ful attempt to deploy configurations. When another 
attempt is made to deploy the configurations, 
Anomaly discards all objects that have not been modi- 
fied, and are not contained within objects that have 
been modified.4 


Figure 11 illustrates a modification to the access 
policy of a computing lab. The left half shows the net- 
work before the change is made; the right half shows 
the network after the change is made. Only objects in 
the shaded area require re-deployments of their con- 
figurations. The shaded area is computed by a simple 
mark and sweep algorithm. 


Checking Types 


Anomaly enforces some type constraints during 
compilation. In particular, it checks assertions for type 
correctness in the obvious manner. For example, if we 
declare that some variable represents an IP address 
and make an assertion about the variable, then 
Anomaly ensures that the assertion associates a well- 
formed IP number with the variable. Anomaly does 
not, however, attempt to enforce constraints at run- 
time. That is, for all those actions for which it cannot 
check type constraints during compilation, the config- 
uration transport mechanisms must enforce the coher- 
ence of the data access operations (e.g., file access, 
SNMP set commands). In practice, this means that if a 
type constraint is violated while Anomaly is generat- 
ing configurations (i.e., after the source code has been 


‘It is also possible to construct an acquisition graph from 
the containment graph, in which each edge represents the ac- 
quisition of some value. Using this graph, the minimal set of 
modified machines can be computed. Currently, Anomaly 
does not compute this minimal modified set. 
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processed), the configuration will either fail or pro- 
duce incorrect results. Making Anomaly truly type 
safe — indeed, exploring what type safety precisely 
means in this context — is future research. 


Templates 


Templates allow Anomaly to instantiate many 
objects at once, using data from a configuration 
database. Figure 12 shows an example template. 





Template ( 
QUERY = "SELECT hostname FROM hosts \ 
WHERE os == ‘solaris’"; 
NAME = hostname; 
Pie xf 
Fqdn hostname; 
hostname = QUERY.hostname; 
acquire access; 


} 
Figure 12: Example template. 





Templates consist of two parts. The first part is 
the query declaration, which appears between the 
parenthesis at the top of the template declaration. It 
specifies a query to be issued to the database, and 
specifies what the identifier (NAME) of each new 
object will be. In this example, one object is instanti- 
ated for each row returned by the specified SQL 
query, and the object is given the name contained in 
the ‘hostname’ column of that row. 


The second part of the template declaration is the 
object declaration. It follows all the same rules as a 
regular object declaration, except that any column of 
the query result can be referenced by the keyword 
‘QUERY,’ as seen on line 8 of Figure 12. 


There is also a second type of template, called a 
containment template, which is used to declare con- 
tainment relationships. An example of this appears at 
the end of Figure 12. It follows the same rules as a 
normal containment declaration, except that it can ref- 
erence columns in the query result, just like the tem- 
plate above. 
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Figure 11: Calculating dependencies after changing access policy for a lab. 
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Practical System Administration with Anomaly 


To see how Anomaly can fit into normal admin- 
istrative practices, we list several sub-categories of 
system management, describe them briefly, and show 
how Anomaly can fit into each area. 


Host management is the area of system manage- 
ment embodied by Cfengine. Most of the examples in 
this paper have focused on this area. Host manage- 
ment tasks lend themselves well to a context-based 
approach, so Anomaly is best suited to this area of 
system management. For Anomaly to succeed, how- 
ever, it must be able to conform itself to existing envi- 
ronments, rather than expecting environments to be 
built around its notions of containment and context. 
Fortunately, typical computing environments are ripe 
with contextual relationships that can be exploited by 
Anomaly. Therefore, we believe that Anomaly has the 
potential to be an appropriate addition to existing 
environments, rather than a foundation for environ- 
ments that are being rebuilt from the ground up. 


Service management includes tasks such as the 
generation of DNS zone files and DHCP configura- 
tions, to name a few. Database driven approaches lend 
themselves well to these tasks, because the configura- 
tion files being produced are often nothing more than 
a listing of the contents of the database in an obscure 
format. Anomaly is not particularly well suited to this 
area of system management, because it often requires 
data about every object in the environment to be 
brought together in a single location. A simple query 
to a configuration database is the appropriate tool for 
this job. To do the same with an acquisition-based tool 
would be awkward and inefficient. 


User management falls partially under the previ- 
ous category, because it may involve services such as 
NIS, LDAP, and Kerberos, but deserves its own cate- 
gory because it also involves resource allocation, 
specifically home directories and mail spools. User 
accounts are laden with contextual information, such 
as the account owner’s position within the organiza- 
tion. Anomaly could be used with an existing user 
management system as a means to model these con- 
textual relationships. Doing so could simplify tasks 
such as disk quota assignment and account expiration. 


Software (or package) management is the area 
embodied by systems such as Depot, Stow, and RPM. 
If software packages are installed locally on individual 
hosts, this category is partially subsumed by host man- 
agement. However, software is often installed on glob- 
ally accessible filesystems (e.g., NFS), so software 
management must be considered separately. 


Anomaly is not intended to be a software man- 
agement system by itself, but it can interface with 
existing software management tools. For example, an 
RPM extension to Anomaly would need only to wrap 
logical assertions around RPM’s query based inter- 
face. The has operator, in this case, would issue a 
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query to see if a given package was installed on a tar- 
get machine, and install it if necessary. Here we see 
that Anomaly can coexist with other systems. There 
would be no reason to stop using the existing system 
and rely only on Anomaly. Anomaly would simply be 
used to model and enforce the contextual relationships 
affecting software installation (e.g., if a machine is 
one of the mail servers for an environment, it must 
have an MTA installed). 


Another practical consideration is that of 
Anomaly’s configuration transport mechanism. In 
practice, Anomaly uses three types of configuration 
mechanisms: configuration files (e.g., /etc/passwd and 
letc/resolv.conf), configuration scripts (e.g., a script that 
executes link statements), and SNMP set commands. 
The configuration files are transported to the appropri- 
ate machines using ssh and a small helper script that 
writes the file contents to the appropriate locations. 
Configuration scripts are copied to a temporary direc- 
tory and executed on the target machine. SNMP set 
commands are executed as Anomaly interprets the 
code. In this sense, Anomaly is primarily a “push” 
tool, that is, configurations represented in Anomaly 
are generated on a single host, and then distributed to 
the managed objects. 


A Caveat 


Acquisition is not a panacea for all system 
administration tasks. In particular, an acquisition- 
based approach does not provide any assistance with 
unique information in a network. For example, when a 
host is moved from one subnet to another, its IP 
address must change,® in addition to its netmask and 
default route. A natural suggestion is that hosts 
acquire their IP addresses from their subnet contain- 
ers, which are responsible for ensuring the uniqueness 
of each address. 


Anomaly’s implementation of acquisition cannot 
facilitate this design, and acquisition in general does 
not lend itself to this approach. This is true for a num- 
ber of reasons, the most important one being that to 
implement such a system, containers would have to 
retain state about the values that had been handed out 
to sub-objects. Storing this state would make it impos- 
sible to cull unmodified objects by constructing a 
dependency graph. 


We do not believe, however, that the inability to 
manage unique information constitutes a genuine 
weakness. The goal of Anomaly is to make it easier to 
manage information that is common to a group of 
objects. System administrators are already good at 
managing unique information, but they do need sup- 
port to keep common information consistent. 


Some First Experiences 


We have used Anomaly to manage a lab of Unix 
workstations, and the switch ports to which they are 


5Changing IP addresses via a remote configuration man- 
agement system is problematic for other reasons. Nonethe- 
less, this example illustrates a range of problems in which 
unique information must be managed. 
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connected. The workstations run Solaris 2.8, and the 
switch is a Cisco Catalyst 5500. Using this experimen- 
tal setup, we were able to construct real containment 
graphs that effectively modeled our test network. By 
moving machines to different positions in the contain- 
ment graph, i.e., by changing their context, we were 
able to verify how quickly an Anomaly-administered 
network can be reconfigured when the components of 
the network change. 


The experiments also revealed some weaknesses 
in Anomaly’s configuration transport mechanisms. 
First, because Anomaly is intended to work with a 
wide variety of hardware (not necessarily UNIX 
machines), restricting all objects to an rsh transport 
mechanism is not feasible. For Anomaly to be 
extended to other operating systems, we need to 
design a universal system for configuration transport. 
Second, to overcome the difficulties of performing 
changes to networking configurations (e.g., IP 
addresses, netmasks), Anomaly should produce floppy 
disks (or other removable media) to transport basic 
configuration data. We hope to address these problems 
with future research. 


Project Status and Source Code Availability 


As of this writing, Anomaly is not ready for 
widespread use. Its most pressing need is for the 
development of a large and portable suite of adminis- 
trative modules. Currently, Anomaly has modules for 
managing Solaris hosts in an NIS/NFS environment, 
and a module for managing certain aspects of Cisco 
Catalyst series switches (specifically, VLAN member- 
ship and port security). 


Some of the source code listings in this paper 
show components of Anomaly that are not stable as of 
this writing for illustrative purposes. Specifically, 
database templates, the ‘packages’ field, and the 
‘patches’ field are only partially implemented. 


The latest source code, along with current infor- 
mation about Anomaly, is available at http://www. 
ccs.neu.edu/home/mlogan/anomaly/. 
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A Simple Way to Estimate the 
Cost of Downtime 


David A. Patterson — University of California at Berkeley 


ABSTRACT 


Systems that are more dependable and less expensive to maintain may be more expensive to 
purchase. If ordinary customers cannot calculate the costs of downtime, such systems may not 
succeed because it will be difficult to justify a higher price. Hence, we propose an easy-to- 


calculate estimate of downtime. 


As one reviewer commented, the cost estimate we propose “‘is simply a symbolic translation 
of the most obvious, common sense approach to the problem.” We take this remark as a 
complement, noting that prior work has ignored pieces of this obvious formula. 


We introduce this formula, argue why it will be important to have a formula that can easily 
be calculated, suggest why it will be hard to get a more accurate estimate, and give some 


examples. 


Widespread use of this obvious formula can lay a foundation for systems that reduce 


downtime. 
Introduction 


It is time for the systems community of 
researchers and developers to broaden the agenda 
beyond performance. The 10,000X increase in perfor- 
mance over the last 20 years means that other aspects of 
computing have risen in relative importance. The sys- 
tems we have created are fast and cheap, but undepend- 
able. Since a portion of system administration is dealing 
with failures [Anderson 1999], downtime surely adds to 
the high cost of ownership. 


To understand why they are undependable, we 
conducted two surveys on the causes of downtime. In 
our first survey, Figure | shows data we collected fail- 
ure data on the U. S. Public Switched Telephone Net- 
work (PSTN) [Enriquez 2002]. It shows the percent- 
age of failures due to operators, hardware failures, 
software failures, and overload for over 200 outages in 
2000. Although not directly relevant to computing 
systems, this data set is very thorough in the descrip- 
tion of the problem and the impact of the outages. In 
our second study, Figure 2 shows data we collected 
failure from three Internet sites [Oppenheimer 2002]. 
The surveys are notably consistent in their suggestion 
that operators are the leading cause of failure. 


Collections of failure data often ignore operator 
error, as it often requires asking operators if they think 
they made an error. Studies that are careful about how 
they collect data do find results that are consistent 
with these graphs [Gray 1985, Gray 1990, Kuhn 1997, 
Murphy 1990, Murphy 1995]. 


Improving dependability and lowering cost of 
ownership are likely to require more resources. For 
example, an undo system for operator actions would 
need more disk space than conventional systems. The 
marketplace may not accept such innovations if 
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products that use them are more expensive and the 
subsequent benefits cannot be quantified by lower cost 
of ownership. Indeed, a common lament of computer 
companies that customers may complain about 
dependability, but are unwilling to pay the higher price 
of more dependable systems. The difficulty of measur- 
ing cost of downtime may be the reason for apparently 
irrational behavior. 


Hence, this paper, which seeks to define a simple 
and useful estimate of the cost of unavailability. 


Estimating Revenue and Productivity 


Prior work on estimating the cost of downtime is 
usually measuring the loss of revenue for online com- 
panies or other services that cannot possibly function 
if their computers are down [Kembel 2000]. Table 1 is 
a typical example. 




















Brokerage operations 
Credit card authorization 
Ebay 
Amazon.com 
Package shipping services 
Home shopping channel 
Catalog sales center 
Airline reservation center 
Cellular service activation 
On-line network fees 


Table 1: Cost of one hour of downtime. From Jnter- 
netWeek 4/3/2000 and based on a survey done by 
Contingency Planning Research. [Kembel 2000]. 


ATM service fees 
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Such companies are not the only ones that lose 
revenue if there is an outage. More importantly, such a 
table ignores the loss to a company of wasting the 
time of employees who cannot get their work done 
during an outage, even if it does not affect revenue. 
Thus, we need a formula that is easy to calculate so 
that administrators and CIOs in any institution can 
determine the costs of outage. It should capture both 
the cost of lost productivity of employees and the cost 
of lost revenue from missed sales. 


Overload 
11% 






Software 
8% 


Operator 


Hard 
lardware 59% 


22% 


Figure 1: Percentage of failures by operator, hardware, 
software, and overload for PSTN. The PSTN data 
measured blocked calls during an outage in the year 
2000. (This figure does not show vandalism, which 
is responsible for 0.5% of blocked calls.) We col- 
lected this data from the FCC; it represents over 200 
telephone outages in the U. S. that affected at least 
30,000 customers or lasted at least 30 minutes. 
Rather than only reporting outages, telephone 
switches record the number of attempted calls 
blocked during an outage, which is an attractive 
metric. The figure does not include environmental 
causes, which are responsible for 1% of the outages. 


We start with the formula, and then explain how 
we derived it: 
Estimated average cost of 1 hour of downtime = 
Empl. costs/hour * % Empl’s affected by outage + 
Avg. Rev./hour * % Rev. affected by outage 


Employee costs per hour is simply the total 
salaries and benefits of all employees per week 
divided by the average number of working hours per 
month. Average revenue per hour is just the total rev- 
enue of an institution per month divided by average 
number of hours per week an institution is open for 
business. Note that this term includes two factors: rev- 
enue associated with a web site and revenue supported 
by the internal information technology infrastructure. 
We believe these employee costs and revenue are not 
too difficult to calculate, especially since they are 
input to an estimate, and hence do not have to be pre- 
cise to the last penny. 
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For example, publicly traded companies must 
report their revenue and expenses every quarter, so 
quarterly statements have some of the data to calculate 
these terms. Although revenue is easy to find, 
employee costs are typically not reported separately. 
Fortunately, they report the number of employees, and 
you can estimate the cost per employee. The finance 
department of smaller companies must know both 
these terms to pay the bills and issue paychecks. Even 
departments in public universities and government 
agencies without conventional revenue sources have 
their salaries in the public record. 


The other two terms of the formula — fraction 
employees and fraction revenue affected by outage — 
just need to be educated guesses or ranges that make 
sense for your institution. 





Overload 
0% 


Software 
34% 


Operator 
51% 





Hardware 
15% 


Figure 2: Percentage of failures by operator, hard- 
ware, software, and overload for three Internet 
sites. Note that the mature software of the PSTN 
is much less of a problem than Internet site soft- 
ware, yet the Internet sites have such frequent 
fluctuations in demand that they have overprovi- 
sioned sufficiently so that overload failures are 
rare. The Internet site data measured outages in 
2001. We collected this data from companies in 
return for anonymity; it represents six weeks to 
six months of services with 500 to 5000 comput- 
ers. Also, 25% of outages had no identifiable 
cause, and are not included in the data. One 
reviewer suggested the explanation was ‘“‘nobody 
confessed;”’ that is a plausible interpretation. 





Is Precision Possible? 


As two of the four terms are guesses, the esti- 
mate clearly is not precise and open to debate. 
Although we all normally strive for precision, it may 
not be possible here. To establish the argument for 
systems that may be more expensive to buy but less 
expensive to own, administrators only need to give an 
example of what the costs might be using the formula 
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above. CIOs could decide on their own fractions in 
making their decisions. 


The second point is that much more effort may 
not ultimately lead to a precise answer, for there are 
some questions that will be very hard to answer. For 
example, depending on the company, one hour of 
downtime may not lead to lost revenue, as customers 
may just wait and order later. In addition, employees 
may simply do other work for an hour that does not 
involve a computer. Depending on centralization of 
services, an outage may only affect a portion of the 
employees. There may also be different systems for 
employees and for sales, so an outage might affect just 
one part or the whole company. Finally, there is surely 
variation in cost depending when an outage occurs; for 
most companies, outages Sunday morning at 2 AM 
probably has little impact on revenue or employee 
productivity. 


Before giving examples, we need to qualify this 
estimate. It ignores the cost of repair, such as the cost 
of overtime by operators or bringing in consultants. 
We assume these costs are small relative to the other 
terms. Second, the estimate ignores daily and seasonal 
variations in revenue, as some hours are more expen- 
sive than others. For example, a company running a 
lottery is likely to have rapid increase in revenue as 
the deadline approaches. Perhaps a best case and 
worst-case cost of downtime might be useful as well 
as average, and its clear how to calculate them from 
the formula. 


Example 1: University of California at Berkeley 
EECS Department 


The Electrical Engineering and Computer Sci- 
ence (EECS) department at U. C. Berkeley does not 
sell products, and so there is no revenue to lose in an 
outage. As we have many employees, the loss in pro- 
ductivity could be expensive. 


The employee costs have two components: those 
paid for by state funds and those paid for by external 
research funds. The state pays 68 full-time staff col- 
lective salaries and benefits of $403,130 per month. 
This is an annual salary and benefits of $71,320. 
These figures do not include the 80 faculty, who are 
paid approximately $754,700 per month year round, 
including benefits. During the school year external 
research pays 670 full-time and part-time employees 
$1,982,500, including benefits. During the summer, 
both faculty and students can earn extra income and 
some people are not around, so the numbers change to 
635 people earning $2,915,950. Thus, the total 
monthly salaries are $3,140,330 during the year and 
$4,073,780 during the summer. 


If we assume people worked 10 hours a working 
day, the Employee costs per hour is $14,780 during 
the school year and $19,170 during the summer. If we 
assumed people worked 24 hours a day, seven days a 
week, the costs would change to $4300 and $5580. 
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If EECS file servers and mail servers have 99% 
availability, that would mean seven hours of downtime 
per month. If half of the outages affected half the 
employees, the annual cost in lost productivity would 
be $250,000 to $300,000. 


Since I was a member of EECS, it only took two 
emails to collect the data. 


Example 2: Amazon 


Amazon has revenue as well as employee costs, 
and virtually all of their revenue comes over the net. 
Last year their revenue was $3.1B and it has 7744 
employees for the last year. This data is available at 
many places; I got it from Charles Schwab. Alas, 
annual reports do not break down the cost of employ- 
ees, just the revenue per employee. That was 
$400,310. Let’s assume that Amazon employee 
salaries and benefits are about 20% higher than Uni- 
versity employees at, say, $85,000 per year. Then 
Employee costs per hour working 10 hours per day, 
five days per week would be $258,100 and $75,100 
for 24x7. Revenue per hour is $353,900 for 24x7, 
which seems the right measure for an Internet com- 
pany. Thus, an outage during the workweek that 
affected 90% employees and 90% of revenue streams 
could cost Amazon about $550,000 per hour. 


We note that employee costs are a significant 
fraction of revenue, even for an Internet company like 
Amazon. One reason is the Internet allows revenue to 
arrive 24x7 while employees work closer to more tra- 
ditional workweeks. 


Example 3: Sun Microsystems 


Sun Microsystems probably gets little of its 
income directly over the Internet, since it has an 
extensive sales force. That sales force, however, relies 
extensively on email to record interactions with cus- 
tomers. The collapse of the World Trade Center 
destroyed several mail servers in the New York area 
and many sales records were lost. Although its not 
likely that much revenue would be lost if the Sun site 
went down, it could certainly affect productivity if 
email were unavailable, as the company’s nervous sys- 
tems appears to be email. 


In the last year, Sun’s revenue was $12.5B and it 
had 43,314 employees. Let’s assume that the average 
salary and benefits are $100,000 per employee, as Sun 
has a much large fraction of its workforce in engineer- 
ing than does Amazon. Then employee costs per hour 
working 10 hours per day, five days per week would 
be $1,698,600 and $494,500 for 24x7. Since Sun is a 
global company, perhaps 24xS is the right model. That 
would make the costs $694,100 per hour. Revenue per 
hour is $1,426,900 for 24x7 and $2,003,200 for 24x5, 
with the latter the likely best choice. 
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Let’s assume that a workweek outage affected 
10% of revenue that hour and 90% of the employees. 
The cost per hour would about $825,000, with three- 
fourths of the cost being employees. 


Conclusion 


The goal of this paper is to provide an easy to cal- 
culate estimate of the average cost of downtime so as to 
justify systems that may be slightly more expensive to 
purchase but potentially spend much less time unavail- 
able. It is important for administrators and CIOs to have 
an easy to use estimate to set the range of costs of out- 
ages to make them more likely to take it into considera- 
tion when setting policies and acquiring systems. If it 
were hard to calculate, few people would do it. 


Although a simple estimate, we argue that a 
much more time-consuming calculation may not shed 
much more insight, as it is very hard to know how 
many consumers will simply reorder when the system 
comes back versus go to a competitor, or how many 
employees will do something else productive while 
the computer is down. 


We see that employee costs, traditionally ignored 
in such estimates, are significant even for Internet 
companies like Amazon, and dominate the costs of 
more traditional organizations like Sun Microsystems. 
Outages at universities and government organizations 
can still be expensive, even without a loss of a signifi- 
cant computer-related revenue stream. 


In addition to this estimate, there are may be indi- 
rect costs to outages that can be as important to the 
company as these more immediate costs. Outages can 
lead to management overhead as the IT department is 
blamed for every possible problem and delay through- 
out the company. Company morale can suffer, reducing 
everyone’s productivity for periods that far exceed the 
outage time. Frequent outages can lead to a loss of con- 
fidence in the IT team and its skills. Such change in 
stature could eventually lead to individual departments 
hiring their own IT people, which lead to direct costs. 


As many researchers are working on these solu- 
tions to the dependability problems [IBM 2000, Patter- 
son 2002], our hope is that these simple estimates can 
help organizations justify systems that are more 
dependable, even if a bit more expensive. 
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ABSTRACT 


Fueled by the growing acceptance of the Web Services Architecture, an emerging trend in 
application service delivery is to move away from tightly coupled systems towards structures of 
loosely coupled, dynamically bound systems to support both long and short business relationships. 
It appears highly likely that the next generation of e-Business systems will consist of an 
interconnection of services, each provided by a possibly different service provider, that are put 
together on an “‘on demand” basis to offer an end to end service to a customer. 


Such an environment, which we call Dynamic e-Business (DeB), will be administered and 
managed according to dynamically negotiated Service Level Agreements (SLA) between service 
providers and customers. Consequently, system administration will increasingly become SLA- 
driven and needs to address challenges such as dynamically determining whether enough spare 
capacity is available to accommodate additional SLAs, the negotiation of SLA terms and 
conditions, the continuous monitoring of a multitude of agreed-upon SLA parameters and the 
troubleshooting of systems, based on their importance for achieving business objectives. 


A key prerequisite for meeting these goals is to understand the relationship between the cost 
of the systems an administrator is responsible for and the revenue they are able to generate, i.e., a 
model needs to be in place to express system resources in financial terms. Today, this is usually 
not the case. 


In order to address some of these problems, this paper presents the Web Service Level 
Agreement (WSLA) framework for defining and monitoring SLAs in inter-domain environments. 
The framework consists of a flexible and extensible language based on the XML schema and a 
runtime architecture based on several SLA monitoring services, which may be outsourced to third 
parties to ensure a maximum of accuracy. 


WSLA enables service customers and providers to unambiguously define a wide variety of 
SLAs, specify the SLA parameters and the way how they are measured, and tie them to managed 
resource instrumentations. A Java-based implementation of this framework, termed SLA 


Compliance Monitor, is publicly available as part of the IBM Web Services Toolkit. 


Introduction and Motivation 


The pervasiveness of the Internet provides a plat- 
form for businesses to offer and buy electronic ser- 
vices, such as financial information, hosted services, 
or even applications, that can be integrated in a cus- 
tomer’s application architecture. Upcoming standards 
for the description and advertisement of, as well as the 
interaction with, online services promise that organi- 
zations can integrate their systems in a seamless man- 
ner. The Web Services framework [16] provides such 
an integration platform, based on the WSDL service 
interface description language, the UDDI directory 
service [31] and, for example, SOAP over HTTP as a 
communication mechanism. Web Services provide the 
opportunity to dynamically bind to services at run- 
time, i.e., to enter (and dismiss) a business relationship 
with a service provider on a case-by-case basis, thus 
creating an infrastructure for dynamic e-Business [14]. 


Dynamic e-Business implies dynamics several 
orders of magnitude higher than found in traditional 
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corporate networks. Moreover, a service relationship 
also constitutes a business relationship between inde- 
pendent organizations, defined in a contract. 


An important aspect of a contract for IT services 
is the set of Quality of Service (QoS) guarantees a ser- 
vice provider gives. This is commonly referred to as a 
service level agreement (SLA) [32, 17]. Today, SLAs 
between organizations are used in all areas of IT ser- 
vices — in many cases for hosting and communication 
services but also for help desks and problem resolution. 


Furthermore, the IT parameters for which Service 
Level Objectives (SLO) are defined come from a vari- 
ety of disciplines, such as business process manage- 
ment, service and application management, and tradi- 
tional systems and network management. In addition, 
different organizations have different definitions for 
crucial IT parameters such as Availability, Throughput, 
Downtime, Bandwidth, Response Time, etc. Today’s 
SLAs are plain natural language documents. Conse- 
quently, they must be manually provisioned and moni- 
tored, which is very expensive and slow. 
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The definition, negotiation, deployment, moni- 
toring and enforcement of SLAs must become — in 
contrast to today’s state of the art — an automated pro- 
cess. This poses several challenges for the administra- 
tion of shared distributed systems, as found in Internet 
Data Centers, because administrative tasks become 
increasingly dynamic and SLA-driven. 


The objective of this paper is to present the Web 
Service Level Agreement (WSLA) framework as an 
approach to deal with these problems; it provides a 
flexible, formal language and a set of elementary ser- 
vices for defining and monitoring SLAs in dynamic e- 
Business environments. 


The paper is structured as follows: In the next 
section, we describe the underlying principles of our 
work, analyze the requirements of dynamic e-Business 
on system administration tasks and on the WSLA 
framework. We also describe the relationships of our 
work to the existing state of the art. The WSLA run- 
time architecture, described later, provides mecha- 
nisms for accessing resource metrics of managed sys- 
tems and for defining, monitoring and evaluating SLA 
parameters according to an SLA specification. We 
subsequently introduce the WSLA language by means 
of several examples. It is based on the XML Schema 
and allows parties to define QoS guarantees for elec- 
tronic services and the processes for monitoring them. 
Finally, the last section concludes the paper and gives 
an overview of our current work. 


Principles of the WSLA Framework 


Service level management has been the subject 
of intense research for several years now and has 
reached a certain degree of maturity. However, despite 
initial work in the field (see, e.g., [2]), the problem of 
establishing a generic framework for service level 
management in cross-organizational environments 
remains unsolved yet. In this section, we introduce the 
terminology and describe the fundamental principles, 
which will be used throughout this paper. Subse- 
quently, focusing on SLA-driven system administra- 
tion, we derive the requirements of the WSLA lan- 
guage and its runtime architecture. 


Terminology 


There are various degrees to which extent a ser- 
vice customer is willing to accept the parameters 
offered by the service provider. Metric- and SLA- 
related information appears at various tiers of a dis- 
tributed system, as depicted in Figure 1. 

¢ Resource Metrics are retrieved directly from 
the managed resources residing in the service 
provider’s tier, such as routers, servers, middle- 
ware and instrumented applications. Typical 
examples are the well-known MIB variables of 
the IETF Structure of Management Information 
(SMI) [21], such as counters and gauges. 
¢ Composite Metrics are created by aggregating 
several resource (or other composite) metrics 
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according to a specific algorithm, such as aver- 
aging one or more metrics over a_ specific 
amount of time or by breaking them down 
according to specific criteria (e.g., top 5%, min- 
imum, maximum etc.). This is usually being 
done within the service providers’ domain but 
can be outsourced to a third-party measurement 
service as well. Composite metrics are exposed 
by a service provider by means of a well- 
defined (usually HTTP or SOAP based) inter- 
face for further processing. 


Business Metrics SLA Parameters Composite Metrics Resource Metrics 


a Dieta, =F + 
Wigi ef 


Measurement Function i, 


; Provider-defined 
Customer-defined 


Figure 1: Aggregating business metrics, SLA param- 
eters and metrics across different organizations. 





e SLA Parameters put the metrics available 
from a service provider into the context of a 
specific customer and are therefore the core 
part of an SLA. In contrast to the previous met- 
rics, every SLA parameter is associated with 
high/low watermarks, which enables the cus- 
tomer, provider, or a designated third party to 
evaluate the retrieved metrics whether they 
meet/exceed/fall below defined service level 
objectives. Consequently, every SLA parameter 
and its permitted range are defined in the SLA. 
It makes sense to delegate the evaluation of 
SLA parameters against the SLOs to an inde- 
pendent third party; this ensures that the evalu- 
ation is objective and accurate. 

e Business Metrics relate SLA parameters to 
financial terms specific to a service customer 
(and thus are usually kept confidential by him). 
They form the basis of a customer’s risk man- 
agement strategy and exist only within the ser- 
vice customer’s domain. It should be noted that 
a service provider needs to perform a similar 
mapping to make sure the SLAs he is willing to 
satisfy are in accordance with his business goals. 


The WSLA framework presented in this paper is 
capable of handling all four different parameter types; 
apart from the latter, they relate directly to systems 
management tasks and are our main focus. However, 
the flexible mechanism for composing SLAs can be 
easily extended to accommodate business metrics. 


Scenarios for SLA Establishment 


Often, it is not obvious to draw a line between 
the aforementioned parameter types, in particular 
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between Composite Metrics and SLA Parameters. 
Therefore, we assume that every parameter related to a 
customer and associated with a guaranteed value range 
is considered an SLA parameter, which is supposed to 
be part of an SLA. However, this distinction is also 
highly dependent on the extent a customer requires the 
customization of metrics exposed by the service 
provider (or a third-party measurement service) — and 
how much he is willing to pay for it. This, in turn, 
depends on the degree of customization the provider is 
willing to apply to its metrics. The following scenarios 
describe various scenarios how SLAs may be defined: 

1. A customer adopts the data exposed by a 
service provider without further refinement 
This is often done when the metrics reflect 
good common practice, cannot be modified by 
the customer or are of small(er) importance to 
him. In this case, the selected metrics become 
the SLA parameters and thus integral parts of 
the SLA. Examples are: length of maintenance 
intervals or backup frequency. 

2. The customer requests that collected data is 
put into a meaningful context A customer is 
probably not interested in the overall availabil- 
ity of a provider’s data center, but needs to 
know the availability of the specific cluster 
within the data center on which his applications 
and data are hosted. A provider’s data collec- 
tion algorithm therefore needs — at least — to 
take into account for which customer the data is 
actually collected. A provider may decide to 
offer such preprocessed data, such as: Avail- 
ability of the server cluster hosting customer 
X's web application. 

3. The customer requests customized data that 
is collected according to his specific require- 
ments While a solution to item 2 can still be 
reasonably static (changes tend to happen rarely 
and the nature of the modifiable parameters can 
be anticipated reasonably well), the degree of 
choice for the customer can be taken a step fur- 
ther by allowing him to specify arbitrary 
parameters, e.g., the input parameters of a data 
collection algorithm. 


This implies that a service provider needs to have 
a mechanism in place that allows a customer to 
provide these input parameters — preferably at 
runtime, e.g., Zhe average load of a server host- 
ing the customers website should be sampled 
every 30 seconds and collected over 24 hours. 
Note that a change of these parameters may 
result in a change of the terms and conditions of 
an SLA, e.g., when a customer chooses sampling 
intervals that are likely to impact the perfor- 
mance of the monitored system; eventually, this 
may entail the violation of SLAs the service 
provider has with other customers. 

4. The customer specifies how data is collected 
This means that he defines — in addition to the 
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metrics and input parameters — the data collec- 
tion algorithm itself. This is obviously the most 
extreme case and seems fairly unlikely. How- 
ever, large customers may insist of getting 
access to very specific data that is not part of 
the standard set, e.g., a customer may want to 
know which employees of a service provider 
had physical access to the systems hosting his 
data and would like to receive a daily log of the 
badge reader. 


This means that — in addition to the aforemen- 
tioned extension mechanisms — a_ service 
provider needs to have a mechanism in place 
that allows him to introduce new data collection 
mechanisms without interrupting his manage- 
ment and production systems. 


While the last case poses the highest challenge 
on the programmability of the monitoring system, a 
service provider benefits greatly from a management 
system being capable of handling such flexible SLAs 
because all the former situations are special cases of 
the latter. It also addresses the extreme variability of 
today’s SLAs. Sample SLAs we analyzed clearly indi- 
cate that there is a need for defining a mechanism that 
allows to unambiguously specify the data collection 
algorithm. Also, it should be noted that the different 
possibilities of specifying service level objectives are 
not mutually exclusive and may all be specified within 
the same SLA. 


SLA-driven System Administration 


Now that we have introduced the concepts of 
SLA management in a dynamic e-Business environ- 
ment, we are able to derive its implications on systems 
administration and management. While it is clear that 
the very high dynamics of the establishment/dismissal 
of business relationships and the resulting alloca- 
tion/deallocation of system resources to different users 
alone is a challenge on its own, we have found several 
other issues that are likely to impact how system 
administration is done in such an environment. The 
way we see the tasks of a system administrator evolve 
are described in the following subsections. 


Express System Resources in Financial Terms 


While system administrators usually have an 
awareness of the costs of the systems they are admin- 
istering, the need to assign prices to the various 
resources on a very fine-grained basis will certainly 
increase. For quite some time, it has been common 
practice in well-run multi-customer data centers to 
account for CPU time, memory usage and disk space 
usage on a per-user basis. What will become increas- 
ingly important in SLA-driven system administration 
is the monitoring, accounting and billing of aggre- 
gated QoS parameters such as response time, through- 
put and bandwidth, which need to be collected across 
a variety of different systems that are involved in a 
multi-tiered server environment. 


191 


Defining and Monitoring Service Level Agreements for Dynamic e-Business 


Having such a fine-grained accounting scheme in 
place is the prerequisite for defining SLOs, together 
with associated penalties or bonuses. In addition, the 
business impact of an outage or delay on the customer 
needs to be assessed. While the latter is mainly rele- 
vant to a service customer, a system administrator on 
the service provider side will need an even better 
understanding of the cost/benefit model behind the 
services offered to a customer. As a sidenote, the abil- 
ity to offer measurement facilities for fine-grained ser- 
vice parameters is likely to become a distinguishing 
factor among service providers. 


Involvement in SLA Negotiation 


The technical expertise of a system administrator 
is likely to play an increasing role in an area that is cur- 
rently confined to business managers and lawyers: The 
negotiation of SLAs terms. While current SLAs are 
dominated by legal terms and conditions, it will become 
necessary in an environment where resources are shared 
among different customers (under a variety of SLAs) to 
evaluate whether enough spare capacity is available to 
accommodate an additional SLA that asks for a specific 
amount of resources without running into the risk that 
the resources become overallocated if a customer’s 
demand increases. While complex resource allocation 
schemes will probably not be deployed in the near 
future, an administrator nevertheless needs to have an 
understanding of the safety margins he must take into 
account when accepting new customers. 


A related problem is to evaluate whether addi- 
tional load due to SLA measurements is acceptable or 
not: While it may well be the case that enough capac- 
ity is available to accommodate the workload resulting 
from the service usage, overly aggressive SLA mea- 
surement algorithms may have a detrimental impact 
on the overall workload a system can handle. An 
extreme example for this is a customer whose applica- 
tion resides on a shared server and who would like to 
have the availability of the system being probed every 
few seconds. In this case, an SLA may either need to 
be rejected due to the additional workload, or the price 
for carrying out the measurements will need to be 
adjusted accordingly. 


Classify Customers According to Revenue 


The previous discussions make it clear that a ser- 
vice provider’s approach to SLA-driven management 
entails the definition of enterprise policies that classify 
customers, e.g., according to the profit margins or 
their degree of contribution to a service provider’s 
overall revenue stream. The involvement of system 
administrators in the process of policy definition and 
enforcement is a consequence of having both a high 
degree of technical understanding and insight into the 
business: First, this expertise is needed to determine 
which policies are reasonable and enforceable. 


Second, once the policies are defined, it is up to 
the administrator to enforce them, e.g., if the resource 
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capacity becomes insufficient because of increased 
workload of a high-paying customer, lower-paying 
customers may be starved out if the penalties associ- 
ated with their SLAs can be offset by the increased 
gains from providing additional capacity to a higher- 
paying customer. Third, it should be noted that such a 
behavior adds an interesting twist to the problem 
determination schemes an administrator uses: The 
non-functioning of a customer’s system may not nec- 
essarily be due to a technical failure, but may well be 
the consequence of a business decision. 


Fix Outages According to Classification 


The establishment of policies and the classifica- 
tion of customers also has implications on how system 
outages are addressed. Traditionally, system adminis- 
trators are trained to address the most severe outages 
first. This may change if a customer classification 
scheme is in place, because then the system whose 
downtime or decreased level of service is the most 
expensive for the service provider will need to be 
fixed first. Outages are likely going to be classified 
not according to their technical severity, but rather 
based on their business impact. 


Lessons Learned From Real-Life SLAs 


A suitable SLA framework for Web Services 
must not constrain the parties in the way they formu- 
late their clauses but instead allow for a high degree of 
flexibility. A management tool that implements only a 
non-modifiable textbook definition of availability 
would not be considered helpful by today’s service 
providers and their customers. 


Our studies of close to three dozen SLAs cur- 
rently used throughout the industry in the areas of 
application service provisioning (ASP) [1], web host- 
ing and information technology (IT) outsourcing have 
revealed that even if seemingly identical SLA parame- 
ters are being defined, their semantics vary greatly. 


While some service providers confine their defi- 
nition of ‘application availability” to the network 
level of the hosting system (‘‘user(s) being able to 
establish a TCP connection to the appropriate 
server’), others refer to the application that imple- 
ments the service (““Customer’s ability to access the 
software application on the server’’). Still others rely 
on the results obtained from monitoring tools (‘‘the 
application is accessible if the server is responding to 
HTTP requests issued by a specific monitoring soft- 
ware’’), while another approach uses elaborate formu- 
las consisting of various metrics, which are sampled 
over fixed time intervals. 


These base clauses are then usually annotated 
with exceptions, such as maintenance intervals, week- 
end/holiday schedules, or even the business impact of 
an outage (“‘An outage has been detected by the ASP 
but no material, detrimental impact on the customer 
has occurred as a result‘‘). The latter example, in par- 
ticular, illustrates the disconnect between the people 
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involved in the negotiation and establishment of an 
SLA (usually business managers and lawyers) and the 
ones who are supposed to enforce it (system adminis- 
trators). One way of closing this gap is to enable sys- 
tem administrators to become involved in the negotia- 
tion of an SLA by providing them with a tool able to 
create a legal document, namely the SLA. 


It is important to keep in mind that, while the 
nature of the clauses may differ considerably among 
different SLAs, the general structure of all the different 
SLAs remains the same: Every analyzed SLA contains 

¢ the involved parties, 

e the SLA parameters, 

e the metrics used as input to compute the SLA 
parameters, 

¢ the algorithms for computing the SLA parame- 
ters, 

e the service level objectives and the appropriate 
actions to be taken if a violation of these SLOs 
has been detected. 


This implies that there is a way to come up with 
an SLA language that can be applied to a multitude of 
bilateral customer/provider relationships. 


WSLA Design Goals 


In this section, we will derive — based on the 
above discussions — the requirements the WSLA 
framework needs to address. 


Ability to Accommodate a Wide Variety of SLAs 


In the introduction of this paper, we have 
stressed the point that SLAs, their parameters and the 
SLOs defined for them are extremely diverse. One 
approach to deal with this problem (e.g., as it is done 
today for simple consumer Web hosting services) is to 
narrow down the “universe of discourse” to a few 
well-understood terms and to limit the possibilities of 
choosing arbitrary QoS parameters through the use of 
SLA templates [24]. SLA templates include several 
automatically processed fields in an otherwise natural 
language-written SLA. However, the flexibility of this 
approach is limited and only suitable for a small set of 
variants of the same type of service using the same 
QoS parameters and a service offering that is not 
likely to undergo changes over time. In situations 
where service providers must address different SLA 
requirements of their customers, they need a more 
flexible formal language to express service level 
agreements and a runtime architecture comprising a 
set of services being able to interpret this language. 


Leverage Work in the B2B Area for SLA Negotiation 
and Creation 


Architectural components and language elements 
related to SLA negotiation, creation and deployment 
should leverage existing concepts developed in the 
electronic commerce and B2B area. In particular, the 
applicability of automated negotiation mechanisms, 
e.g., currently being developed within the scope of 
the OASIS/ebXML [6] Collaboration Profiles and 
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Agreements initiative [7], should be applicable to the 
negotiation of SLAs as well. A vast amount of work 
on electronic contracts [25, 22], contract languages 
[12] and contract negotiation has been carried out in 
the electronic commerce and B2B arena [4]. We later 
describe our usage of obligations, a concept widely 
used in e-commerce, for monitoring SLAs. 


Apply the “Need to Know” Principle to SLA Deployment 


For each service provider and customer relation- 
ship, several instances of a service may exist. The 
functionality of computing SLA parameters or evalu- 
ating contract obligations may be split, e.g., among 
multiple measurement or SLO evaluation services, 
each provided by a different organization. It is there- 
fore important that every service instance receives 
only the part of the contract it needs to know to carry 
out its task. Since it may be possible that a contractual 
party delegates the same task (such as measurements) 
to several different third party services (in order to be 
able to cross-check their results), different service 
instances may not be aware of other instances. This 
implies that every party involved in the SLA monitor- 
ing process receives only the part of the SLA that is 
relevant for him. We present our approach for dealing 
with this problem later. 


Another major issue that underlines the importance 
of this “Need to know” principle are the privacy con- 
cerns of the various parties involved in an inter-domain 
management scenario: A service provider is, in general, 
neither interested in disclosing which of his business pro- 
cesses have been outsourced to other providers, nor the 
names of these providers. On the other hand, customers 
of a dynamic e-Business will not necessarily see a need 
anymore to know the exact reason of performance degra- 
dations as long as a service provider is able to take 
appropriate remedies (or compensate its customers for 
the incurred service level violation). 


Traditionally, end-to-end performance manage- 
ment has been the goal of traditional enterprise man- 
agement efforts and is often explicitly listed as a 
requirement (see, e.g., [26]). However, the aforemen- 
tioned privacy concerns of service providers and the 
service customers’ need for transparency make that an 
end-to-end view becomes unachievable (and irrele- 
vant!) in a dynamic e-Business environment spanning 
multiple organizational domains. 


Delegate Monitoring Tasks to Third Parties 


Traditionally, an SLA is a bilateral agreement 
between a service customer and a service provider: 
The enhanced Telecom Operations Map (eTOM) [29], 
for example, defines various roles services providers 
can play; however, this work does not provide the del- 
egation of management functionality to further service 
providers. We refer to the parties that establish and 
sign the SLA as signatory parties. In addition, SLA 
monitoring may require the involvement of third par- 
ties: They come into play when either a function needs 
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to be carried out that neither service provider nor cus- 
tomer wants to do, or if one signatory party does not 
trust the other to perform a function correctly. Third 
parties act then in a supporting role and are spon- 
sored by either one or both signatory parties. 


The targeted environment of our work is a typi- 
cal service provider environment (Internet storefronts, 
B2B marketplaces, web hosting, ASP), which consists 
of multiple, independent parties that collaborate 
according to the terms and conditions specified in the 
SLA. Consequently, the services of our architecture 
are supposed to be distributed among the various par- 
ties and need to interact across organizational 
domains. Despite the focus on cross-organizational 
entities, WSLA can be applied to environments in 
which several (or even all of the) services reside 
within the boundaries of a single organizational 
domain, such as in a traditional corporate network. 


The work of the IST Project FORM [8] is highly 
relevant for our work, since it focuses on SLAs in an 
inter-domain environment. FORM also deals with the 
important issue of federated accounting [3], which we 
do not address in this paper. An approach for a generic 
service model suitable for customer service manage- 
ment is presented in [9]. 


SLA-driven Resource Configuration 


Since the terms and conditions of an SLA may 
entail setting configuration parameters on a potentially 
wide range of managed resources, an SLA manage- 
ment framework must accommodate the definition of 
SLAs that go beyond electronic/web services and 
relate to the supporting infrastructure. On the one 
hand, it needs to tie the SLA to the monitoring param- 
eters exposed by the managed resources so that an 
SLA monitoring infrastructure is able to retrieve 
important metrics from the resources. [33] defines a 
MIB for SLA performance monitoring in an SNMP 
environment, whereas the SLA handbook from Tele- 
Management Forum [27] proposes guidelines for 
defining SLAs for telecom service providers. 


An approach for the performance instrumenta- 
tion of EJB-base distributed applications is described 
in [5]. The capability of mapping resource metrics to 
SLA parameters is crucial because a service provider 
must be able to answer the following questions before 
signing an SLA: 

¢ Is it possible to accept an SLA for a specific 
service class given the fact that the capacity is 
limited? 

¢ Can additional workload be accommodated? 

On the other hand, it is desirable to derive configu- 
ration settings directly from SLAs. However, the hetero- 
geneity and complexity of the management infrastruc- 
ture makes configuration management a challenge. Suc- 
cessful work in this area often focuses on the network 
level: [10] describes a network configuration language; 
the Policy Core Information Model (PCIM) of the IETF 
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[23] provides a generic framework for defining policies 
to facilitate configuration management. 


Existing work in the e-commerce area may be 
applied here as well since the concept of contract-driven 
configuration in e-commerce environments [11] and vir- 
tual enterprises [20, 13] has similarities to the SLA- 
driven configuration of managed resources. 


WSLA Runtime Architecture 


In this section, we describe the WSLA runtime 
architecture by breaking it down into its atomic building 
blocks, namely the elementary services needed to enable 
the management of an SLA throughout the phases of its 
lifecycle. The first part describes the information flows 
and interactions between the different services. The next 
section demonstrates how the SLA management ser- 
vices identified earlier cooperate in an inter-domain 
environment, where the task of SLA management itself 
is dynamically delegated to an arbitrary number of man- 
agement service providers. 


WSLA Services and their Interactions 


The components described in this section are 
designed to address the “need to know” principle and 
constitute the atomic building blocks of the WSLA 
monitoring framework. The components are intended to 
interact across multiple domains; however, it is possible 
that some components may be co-located within a single 
domain and not necessarily exposed to objects residing 
within another domain. 
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Figure 2: Interactions between the WSLA services. 


Figure 2 gives an overview of the SLA manage- 
ment lifecycle, which consists of five distinct phases. We 
assume that an SLA is defined for a web service, which 
is running in the servlet engine of a web application 
server. The web application server exposes a variety of 
management information either through the graphical 
user interface of an administration console or at its moni- 
toring and management interfaces, which are accessed 
by the various services of the WSLA framework. 


The interface of the web service is defined by an 
XML document in the Web Services Description 
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Language (WSDL). The SLA references this WSDL 
document and extends the service definition with SLA 
management information. Typically, an SLA defines 
several SLA parameters, each referring to an operation 
of the web service. However, an SLA may also refer- 
ence the service as a whole, or even compositions of 
multiple web services [30]. The phases and the ser- 
vices that implement the functionality needed during 
the various phases are as follows. 


Phase 1: SLA Negotiation and Establishment 


The SLA is being negotiated and signed by both 
signatory parties. This is done by means of an SLA 
Establishment Service, i.e., an SLA authoring tool 
that lets both signatory party establish, price and sign 
an SLA for a given service offering. This tool allows a 
customer to retrieve the metrics offered by a service 
provider, aggregate and combine them into various 
SLA parameters, request approval from both parties, 
define secondary parties and their tasks, and make the 
SLA document available for deployment to the 
involved parties (dotted arrows in Figure 2). 


Phase 2: SLA Deployment 


Deployment Service: The deployment service is 
responsible for checking the validity of the SLA and 
distributing it either in full or in appropriate parts to 
the involved components (dashed arrows in Figure 2). 
Since two signatory parties negotiate the SLA, they 
must inform the supporting parties about their respec- 
tive roles and duties. Two issues must be addressed: 

1. Signatory parties do not want to share the 
whole SLA with their supporting parties but 
restrict the information to the relevant informa- 
tion such that they can configure their compo- 
nents. Signatory parties must analyze the SLA 
and extract relevant information for each party. 
In the case of a measurement service, this is 
primarily the definition of SLA parameters and 
metrics. SLO evaluation services get the SLOs 
they need to verify. All parties need to know 
the definitions of the interfaces they must 
expose, as well as the interfaces of the partners 
they interact with. 

2. Components of different parties cannot be 
assumed to be configurable in the same way, 
i.e., they may have heterogeneous configuration 
interfaces. 


Thus, the deployment process contains two steps. 
In the first step, the SLA deployment system of a signa- 
tory party generates and sends configuration information 
in the Service Deployment Information (SDI) format 
(omitted for the sake of brevity), a subset of the lan- 
guage described later, to its supporting parties. In the 
second step, deployment systems of supporting parties 
configure their own implementations in a suitable way. 


Phase 3: Measurement and Reporting 


This phase deals with configuring the runtime 
system in order to meet one or a set of SLOs, and with 
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carrying out the computation of SLA parameters by 
retrieving resource metrics from the managed 
resources and executing the management functions 
(solid arrows in Figure 2). The following services 
implement the functionality needed during this phase: 


Measurement Service: The Measurement Ser- 
vice maintains information on the current system con- 
figuration, and run-time information on the metrics that 
are part of the SLA. It measures SLA parameters such 
as availability or response time either from inside, by 
retrieving resource metrics directly from managed 
resources, or outside the service provider’s domain, 
e.g., by probing or intercepting client invocations. A 
Measurement Service may measure all or a subset of 
the SLA parameters. Multiple measurement services 
may simultaneously measure the same metrics. 


Condition Evaluation Service: This service is 
responsible for comparing measured SLA parameters 
against the thresholds defined in the SLA and notify- 
ing the management system. It obtains measured val- 
ues of SLA parameters from the Measurement Service 
and tests them against the guarantees given in the 
SLA. This can be done each time a new value is avail- 
able, or periodically. 


Phase 4: Corrective Management Actions 


Once the Condition Evaluation Service has deter- 
mined that an SLO has been violated, corrective man- 
agement actions need to be carried out. The function- 
ality that needs to be provided in this phase spans two 
different services: 


Management Service: Upon receipt of a notifi- 
cation, the management service (usually implemented 
as part of a traditional management platform) will 
retrieve the appropriate actions to correct the problem, 
as specified in the SLA. Before acting upon the man- 
aged system, it consults the business entity (see 
below) to verify if the proposed actions are allowable. 
After receiving approval, it applies the action(s) to the 
managed system. 


It should be noted that the management compo- 
nent seeks approval for every proposed action from 
the business entity. The main purpose of the manage- 
ment service is to execute corrective actions on behalf 
of the managed environment if a Condition Evaluation 
Service discovers that a term of an SLA has been vio- 
lated. While such corrective actions are limited today 
to opening a trouble ticket or sending an event to the 
provider’s management system, we envision this com- 
ponent playing a crucial role in the future by acting as 
an automated mediator between the customer and 
provider, according to the terms of the SLA. This 
includes the submission of proposals to the manage- 
ment system of a service provider on how a perfor- 
mance problem could be resolved (e.g., proposing to 
assign a different traffic category to a customer if sev- 
eral categories have been defined in the SLA). 


Our implementation addresses very simple cor- 
rective actions; finding a generic, flexible and 
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automatically executable mechanism for corrective 
management actions remains an open issue yet. 


Business Entity: It embodies the business know- 
ledge, goals and policies of a signatory party (here: 
service provider), which are usually kept confidential. 
Such knowledge enables the business entity to verify 
if the actions specified in the SLA (eventually some 
time ago) are still compatible with the actual business 
targets. If this is the case, the business entity will send 
a positive acknowledgement to the request of the 
Management Service; in case the proposed actions are 
in conflict with the actual goals of the service 
provider, its business entity will decline the request 
and the management service will refrain from carrying 
them out. It should be noted that declining prior 
agreed-upon actions may be regarded by another party 
as a breach of the SLA entailing, in an severe case, 
termination of the business relationship. Since it is 
unlikely that decisions of this importance will be left 
to the discretion of an automated system, we assume 
that the decision of the business entity requires human 
intervention. While we have implemented the afore- 
mentioned services, we have postponed an implemen- 
tation of a business entity component until appropriate 
mechanisms for specifying and enforcing business 
policies are available. 


Our experience shows that the tasks covered by 
these two services become extremely complicated as 
soon as sophisticated management actions need to be 
specified: First, a service provider would need to 
expose what management operations he is able to exe- 
cute, which is very specific to the management plat- 
forms (products, architectures, protocols) he uses. Sec- 
ond, these management actions may become very com- 
plicated and may require human interaction (such as 
deploying new servers). Finally, due to the fact that the 
provider’s managed resources are shared among vari- 
ous customers, management actions that satisfy an SLA 
with one customer are likely to impact the SLAs the 
provider has with other customers. The decision 
whether to satisfy the SLA (or deliberately break it) 
therefore is not a technical decision anymore, but rather 
a matter of the provider’s business policies and, thus, 
lies beyond the scope of the work discussed in this 
paper. Consequently, only a few elements of the WSLA 
language address this phase of the service lifecycle. 


Phase 5: SLA Termination 


The SLA may specify the conditions under 
which it may be terminated or the penalties a party 
will incur by breaking one or more SLA clauses. 
Negotiations for terminating an SLA may be carried 
out between the parties in the same way as the SLA 
establishment is being done. Alternatively, an expira- 
tion date for the SLA may be specified in the SLA. 


SLA Compliance Monitor Implementation 


Figure 2 shows which WSLA services have been 
implemented. Because of their major importance and 
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their excellent suitability for automated processing, 
the Deployment, Measurement and Condition Eval- 
uation services have been implemented by us. These 
services are implemented as Web Services themselves 
and are jointly referred to as SLA Compliance Moni- 
tor, which acts as a wrapper for the three services. For 
information where to download the implementation, 
the reader is referred to the ‘Availability’ section. Our 
ongoing implementation efforts, aimed at completing 
the WSLA framework, are described in ‘Conclusions 
and Outlook.’ 


Signatory and Supporting Parties 


Figure 3 gives an overview of a configuration 
where two signatory parties and two supporting parties 
collaborate in the monitoring of an SLA. 


In bilateral SLAs, it is usually straightforward to 
define for each commitment who is the obliged and 
who is the beneficiary of the commitment. However, 
in an SLA containing more than two parties, it is not 
obvious which party guarantees what to whom. A 
clear definition of responsibilities is required. The 
WSLA environment involves multiple parties to enact 
an SLA instance. As mentioned above, a part of the 
monitoring and supervision activities can be assigned 


to parties other than the service provider and customer. 


Violation 
Notifications 


~~ Condition 
; Evaluation j 


Figure 3: Signatory and supporting parties. 


We approach the issue of responsibility by defin- 
ing two classes of parties: Service provider (ACME- 
Provider in Figure 3) and service customer (XInc) are 
the signatory parties to the SLA. They are ultimately 
responsible for all obligations, mainly in the case of 
the service provider, and the ultimate beneficiary of 
obligations. Supporting parties are sponsored either by 
one or both of the signatory parties to perform one or 
more of a particular set of roles: A measurement ser- 
vice (YMeasurement) implements a part or all of the 
measurement and computation activities defined 
within an SLA. A condition evaluation service (ZAudit- 
ing) implements violation detection and other state 
checking functionality that covers all or a part of the 
guarantees of an SLA. A management service imple- 
ments corrective actions. 


There can be multiple supporting parties having 
a similar role, e.g., a measurement service may be 
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located in the provider’s domain while another mea- 
surement service probes the service offered by the 
provider across the Internet from various locations. 
Keynote Systems, Inc. [15] is an example of such an 
external measurement service provider. 


Despite the fact that a multitude of parties may 
be involved in providing a service, these interactions 
may be broken down into chained customer/provider 
relationships. Every interaction therefore involves 
only two roles, a sender and a recipient. During our 
work, we have not encountered a need for multi-party 
SLAs, i.e., SLAs that are simultaneously negotiated 
and signed by more than two parties. Multi-party con- 
tracts do not seem to provide enough value to justify 
their added complexity. 


The WSLA Language 


The WSLA language, specified in [19], defines a 
type system for the various SLA artifacts and is based 
on the XML Schema [34, 35]. We give an overview of 
the general structure of an SLA and motivate the vari- 
ous constructs of the WSLA language that will be 
described by means of examples in the subsequent 
sections: The information that needs to be processed 
by a Measurement Service is described later; then, we 
focus on the parts of the language a Condition Evalua- 
tion Service needs to understand for evaluating if a 
service level objective has been violated. 


WSLA in a Nutshell 


Figure 4 illustrates the typical elements of a SLA 
with signatory and supporting parties. Clearly, there 
are many variations of what types of information and 
which rules are to be included and, hence, enforced in 
a specific SLA. 


Defining and Monitoring Service Level Agreements for Dynamic e-Business 


The Parties section, consisting of the signatory 
parties and supporting parties fields identify all the 
contractual parties. Signatory Party descriptions con- 
tain the identification and the technical properties of a 
party, ie., their interface definition and their 
addresses. The definitions of the Supporting Parties 
contain, in addition to the information contained in the 
signatory party descriptions, an attribute indicating the 
sponsor(s) of the party. 

The Service Description section of the SLA 
specifies the characteristics of the service and its 
observable parameters as follows: 

e For every Service Operation, one or more 
Bindings, i.e., the transport encoding for the 
messages -to be exchanged, may be specified. 
Examples of such bindings are SOAP (Simple 
Object Access Protocol), MIME (Multipurpose 
Internet Mail Extensions) or HTTP (HyperText 
Transfer Protocol). 

e In addition, one or more SLA Parameters of 
the service may be specified. Examples of such 
SLA parameters are service availability, 
throughput, or response time. 

e As mentioned earlier, every SLA parameter 
refers to one (composite) Metric, which, in 
turn, aggregates one or more other (composite 
or resource) metrics, according to a measure- 
ment directive or a function. Examples of com- 
posite metrics are maximum response time of a 
service, average availability of a service, or 
minimum throughput of a service. Examples of 
resource metrics are: system uptime, service 
outage period, number of service invocations. 

e Measurement Directives specify how an 

individual metric can be accessed. Typical 
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Supporting Parties 
Service Description: 
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Corrective actions to be carried out 


Figure 4: General structure of an SLA. 
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examples of measurement directives are 
the uniform resource identifier of a hosted 
computer program, a protocol message, or 
the command for invoking scripts or com- 
piled programs. 
Functions are the measurement algorithm, 
or formula, that specifies how a composite 
metric is computed. Examples of functions 
are formulas of arbitrary length containing 
average, sum, minimum, maximum, and 
various other arithmetic operators, or time 
series constructors. 

e For every function, an Evaluation Period 
is specified. It defines the time intervals 
during which the functions are executed to 
compute the metrics. These time intervals 
are specified by means of start time, dura- 
tion, and frequency. Examples of the latter 
are weekly, daily, hourly, or every minute. 


Obligations, the last section of an SLA, define 
various guarantees and constraints that may be 
imposed on the SLA parameters: 

e First, the Validity Period is specified; it indi- 

cates the time intervals for which a given SLA 
parameter is valid, i.e., when the SLO may be 
applied. Examples of validity periods are busi- 
ness days, regular working hours or mainte- 
nance periods. 
The Predicate specifies the threshold and the 
comparison operator (greater than, equal, less 
than, etc.) against which a computed SLA 
parameter is to be compared. The result of the 
predicate is either true or false. 


Keller & Ludwig 


* Actions, finally, are triggered whenever a pred- 
icate evaluates to true, i.e., a violation of an 
SLO has occurred. Actions are e.g., sending an 
event to one or more signatory and supporting 
parties, opening a trouble ticket or problem 
report, payment of penalty, or payment of pre- 
mium. Note that, as stated in the latter case, a 
service provider may very well receive addi- 
tional compensation from a customer for 
exceeding an obligation, i.e., obligations reflect 
constraints that may trigger the payment of 
credits from any signatory party to another sig- 
natory or supporting party. Also note that zero 
or more actions may be specified for every 
SLA parameter. 


Service Description: Associating SLA Parameters 
with a Service 


The purpose of the service description is the clar- 
ification of three issues: Jo which service do SLA 
parameters relate? What are the SLA parameters? 
How are the SLA parameters measured or computed? 
This is the information a Measurement Service 
requires to carry out its tasks. A sample service 
description is depicted in Figure 5. 


Service Objects and Operations 


The service object, depicted at the top of Figure 
5, provides an abstraction for all conceptual elements 
for which SLA parameters and the corresponding met- 
rics can be defined. In the context of Web Services, 
the most detailed concept whose quality aspect can be 
described separately is the individual operation (in a 
binding) described in a WSDL specification. In our 
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Figure 5: Sample elements of a service description. 
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example, the operation getQuote is the service object. 
In addition, quality properties of groups of WSDL 
operations can be defined — the operation group being 
the service object in this case. Outside the scope of 
Web Services, business processes, or parts thereof, can 
be service objects (e.g., defined in WSFL [18]). Ser- 
vice objects have a set of SLA parameters, a set of 
metrics that describe how SLA parameters are com- 
puted or measured and a reference to the service itself 
that is the subject of the service object abstraction. 


While the format for SLA parameters and met- 
rics is the same for all services (though not their indi- 
vidual content), the reference to the service depends 
on the particular way in which the service is 
described. For example, service objects may contain 
references to operations ina WSDL file. 


SLA Parameters and Metrics 


SLA parameters are properties of a service 
object; each SLA parameter has a name, type and unit. 
SLA parameters are computed from metrics, which 
either define how a value is to be computed from other 
metrics or describe how it is measured. For this pur- 
pose, a metric either defines a function that can use 
other metrics as operands or it has a measurement 
directive that describes how the metric’s value should 
be measured. Since SLA parameters are the entities 
that are surfaced by a Measurement Service to a Con- 
dition Evaluation Service, it is important to define 
which party is supposed to provide the value (Source) 
and which parties can receive it, either event-driven 
(Push) or through polling (Pull). Note that one of our 
design choices is that SLA parameters are always the 
result of a computation, i.e., no SLA parameters can 
be defined as input parameters for computing other 
SLA parameters. In Figure 5, one metric is retrieved 
by probing an interface (Service Probe) while the other 
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ones (TXcount, Timecount) are directly retrieved from 
the service provider’s management system. 


<SLAParameter name="UpTimeRatio" 
type="float" unit="downEvents/hour"> 
<Metric>UpTimeRatioMetric</Metric> 
<Communication> 
<Source>ACMEProvider</Source> 
<Pull>ZAuditing</Pull> 
<Push>ZAuditing</Push> 
</Communication> 
</SLAParameter> 


Figure 6: Defining an SLA Parameter UpTimeRatio. 


Figure 6 depicts how an SLA parameter UpTi- 
meRatio is defined. It is assigned the metric UpTimeRa- 
tioMetric, which is defined independently of the SLA 
parameter for being used potentially multiple times. 
ACMEProvider promises to send (push) new values to 
ZAuditing, which is also allowed to retrieve new val- 
ues on its own initiative (pull). The purpose of a metric 
is to define how to measure or compute a value. 
Besides a name, a type and a unit, it contains either a 
function or a measurement directive and a definition 
of the party that is in charge of computing this value. 


Figure 7 shows an example composite metric 
containing a function. UpTimeRatioMetric is of type 
double and has no unit. YMeasurement is in charge of 
computing this value. The example illustrates the con- 
cept of a function: The number of occurrences of “0” 
in a time series of the metric StatusTimeSeries — assum- 
ing this represents a down event in time series of 
probes once per minute — is divided by 1440 (the num- 
ber of minutes of a day) to yield the downtime ratio. 
This value is subtracted from | to obtain the UpTimeR- 
atio. Specific functions, such as Minus, Plus or Val- 
ueOccurs are extensions of the common function type. 


<Metric name="UpTimeRatioMetric" type="double" unit=""> 


<Source>YMeasurement</Source> 


<Function xsi:type="Minus" resultType="double"> 


<Operand> 
<LongScalar>1</LongScalar> 

</Operand> 

<Operand> 


<Function xsi:type="Divide" resultType="long"> 


<Operand> 


<Function xsi:type="ValueOccurs" resultType="long"> 
<Metric>StatusTimeSeries</Metric> 


<Value> 


<LongScalar>0</LongScalar> 


</Value> 
</Function> 
</Operand> 
<Operand> 


<LongScalar>1440</LongScalar> 


</Operand> 
</Function> 
</Operand> 
</Function> 
</Metric> 


Figure 7: Defining a Metric UpTimeRatioMetric. 
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Operands of functions can be metrics, scalars and 
other functions. It is expected that a measurement ser- 
vice, provided either by a signatory or a supporting 
party, is able to compute functions. Specific functions 
can be added to the language as needed. 


A Measurement Directive, depicted in Figure 8, 
specifies how the metric is retrieved from the source 
(either by means of a well-defined query interface 
offered by the Service Provider, or directly from the 
instrumentation of a managed resource by means of a 
management protocol operation). A specific type of 
measurement directive is used in the example above: 
StatusRequest. It contains a URL that is used for prob- 
ing whether the getQuote operation is available. Appar- 
ently, other ways to measure values require an entirely 
different set of information items, e.g., an SNMP port, 
an object identifier (OID) and an instance identifier to 
retrieve a counter. 


Obligations: SLOs and Action Guarantees 


Based on the common ontology established in 
the service definition part of the SLA, the parties can 
unambiguously define the respective guarantees that 
they give each other. The WSLA language provides 
two types of obligations: 

e Service level objectives represent promises 
with respect to the state of SLA parameters. 

e Action guarantees are promises of a signatory 
party to perform an action. This may include 
notifications of service level objective viola- 
tions or invocation of management operations. 


Important for both types of obligations is the def- 
inition of the obliged party and the definition of when 
the obligations need to be evaluated. Both have a simi- 
lar syntactical structure (as previously depicted in Fig- 
ure 4). However, their semantics are different. The 
content of an obligation is refined in a service level 
objective or an action guarantee. 


Service Level Objectives 
A service level objective expresses a commitment 
to maintain a particular state of the service in a given 


period. Any party can take the obliged part of this guar- 
antee; however, this is typically the service provider. 


A service level objective has the following ele- 
ments: Obliged is the name of a party that is in charge of 
delivering what is promised in this guarantee. One or 
many ValidityPeriods define when the SLO is applicable. 


A logic Expression defines the actual content of 
the guarantee, i.e., what is asserted by the service 
provider to the service customer. Expressions follow 
first order logic and contain the usual operators and, 
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or, not, etc., which connect either predicates or, again, 
expressions. Predicates can have SLA parameters and 
scalar values as parameters. By extending an abstract 
predicate type, new domain-specific predicates can be 
introduced as needed. Similarly, expressions could be 
extended, e.g., to contain variables and quantifiers. 
This provides the expressiveness to define complex 
states of the service. 


A service level objective may also have an Evalua- 
tionEvent, which defines when the expression of the ser- 
vice level objective should be evaluated. The most com- 
mon evaluation event is NewValue, each time a new 
value for an SLA parameter used in a predicate is avail- 
able. Alternatively, the expression may be evaluated 
according to a Schedule. A schedule is a sequence of 
regularly occurring events. It can be defined within a 
guarantee or may refer to a commonly used schedule. 


The example in Figure 9 illustrates a service level 
objective given by ACMEProvider and valid for a full 
month in the year 2001. It guarantees that the SLA 
parameter ThroughPutRatio must be greater than 1000 if 
the SLA parameter UpTimeRatio is less than 0.9, i.e., the 
ThroughPutRatio must be above 1000 transactions per 
minute even if the overall availability is below 90%. 
This condition should be evaluated each time a new 
value for the SLA parameter is available. Note that we 
deliberately chose that validity periods are always speci- 
fied with respect to a single SLA parameter, and thus 
only indirectly applicable to the scope of the overall 
SLA. Alternatively, validity periods to the overall SLA 
(possibly in addition to the validity periods for each 
SLA parameter) could be possible, but we found that 
this granularity is too coarse. 


Action Guarantees 


An action guarantee expresses a commitment to 
perform a particular activity if a given precondition is 
met. Any party can be the obliged of this kind of guar- 
antee. This particularly includes also the supporting 
parties of the SLA. 


An action guarantee comprises the following ele- 
ments and attributes: Obliged is the name of a party 
that must perform an action as defined in this guaran- 
tee. A logic Expression defines the precondition of the 
action. The format of this expression is the same as the 
format of expression in service level objectives. An 
important predicate for action guarantees is the Viola- 
tion predicate that determines whether another guaran- 
tee, in particular a service level objective, has been 
violated. An EvaluationEvent or an evaluation Schedule 
defines when the precondition is evaluated. 


<Metric name="ServiceProbe" type="integer" unit=""> 


<Source>YMeasurement</Source> 


<MeasurementDirective xsi:type="StatusRequest" resultType="integer"> 
<RequestURL>http://ymeasurement.com/StatusRequest/GetQuote</RequestURL> 


</MeasurementDirective> 
</Metric> 


Figure 8: Defining a Measurement Directive StatusRequest. 
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QualifiedAction contains a definition of the action 
to be invoked at a particular party. The concept of a 
qualified action definition is similar to the invocation 
of an object method in a programming language, 
replacing the object name with a party name. The 
party of the qualified action can be the obliged or 
another party. The action must be defined in the corre- 
sponding party specification. In addition, the specifi- 
cation of the action includes the marshalling of its 
parameters. One or more qualified actions can be part 
of an action guarantee. 


ExecutionModality is an additional means to control 
the execution of the action. It can be defined whether the 
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action should be executed if a particular evaluation of the 
expression yields true. The purpose is to reduce, for 
example, the execution of a notification action to a nec- 
essary level if the associated expression is evaluated very 
frequently. Execution modality can be either: always, on 
entering a condition or on entering and leaving a condi- 
tion. The example depicted in Figure 10 illustrates an 
action guarantee. 


In the example, ZAuditing is obliged to invoke the 
notification action of the service customer XInc if a 
violation of the service level objective SLO_For_ 
ThroughPut_and_UpTime (cf. Figure 9) occurs. The pre- 
condition should be evaluated every time the 





<ServiceLevelObjective name="SLO_For_ThroughPut_and_UpTime"> 


<Obliged>ACMEProvider</Obliged> 
<Validity> 


<Start>2001-11-30T14:00:00.000-05:00</Start> 
<End>2001-12-31T14:00:00.000-05:00</End> 


</Validity> 
<Expression> 
<Implies> 
<Expression> 
<Predicate xsi:type="Less"> 


<SLAParameter>UpTimeRatio</SLAParameter> 


<Value>0.9</Value> 
</Predicate> 
</Expression> 
<Expression> 


<Predicate xsi:type="Greater"> 


<SLAParameter>ThroughPutRatio</SLAParameter> 


<Value>1000</Value> 
</Predicate> 
</Expression> 
</Implies> 
</Expression> 


<EvaluationEvent >NewValue</EvaluationEvent> 


</ServiceLevel0bjective> 


Figure 9: Defining a Service Level Objective SLO_For_ThroughPut_and_UpTime. 


<ActionGuarantee name="Must_Send_Notification_Guarantee"> 


<Obliged>ZAuditing</Obliged> 
<Expression> 
<Predicate xsi:type="Violation"> 
<ServiceLevelObjective> 


SLO_For_ThroughPut_and_UpTime 


</ServiceLevel0bjective> 
</Predicate> 
</Expression> 


<EvaluationEvent>NewValue</EvaluationEvent> 


<QualifiedAction> 
<Party>XInc</Party> 


<Action actionName="notification" xsi:type="Notification"> 
<NotificationType>Violation</NotificationType> 


<CausingGuarantee> 


Must_Send_Notification_Guarantee 


</CausingGuarantee> 


<SLAParameter>ThroughPutRatio UpTimeRatio</SLAParameter> 


</Action> 
</QualifiedAction> 


<ExecutionModality>Always</ExecutionModality> 


</ActionGuarantee> 


Figure 10: Defining an ActionGuarantee Must_Send_Notification_Guarantee. 
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evaluation of the SLO Must_Send_Notification_Guaran- 
tee returns a new value. The action has three parame- 
ters: the type of notification, the guarantee that caused 
it to be sent, and the SLA parameters relevant for 
understanding the reason of the notification. The noti- 
fication should always be executed. 


Conclusions and Outlook 


This paper has introduced the novel WSLA 
framework for electronic services, in particular Web 
Services. The WSLA language allows a_ service 
provider and its customer to define the quality of ser- 
vice aspects of the service. The concept of supporting 
parties allows signatory parties to include third parties 
into the process of measuring the SLA parameters and 
monitoring the obligations associated with them. In 
order to avoid the potential ambiguity of high-level 
SLA parameters, parties can define precisely how 
resource metrics are measured and how composite 
metrics are computed from others. The WSLA lan- 
guage is extensible and allows us to derive new 
domain-specific or technology-specific elements from 
existing language elements. The explicit representa- 
tion of service level objectives and action guarantees 
provides a very flexible mechanism to define obliga- 
tions on a case-by-case basis. Finally, the detachment 
from the service description itself makes the WSLA 
language and its associated services applicable to a 
wide range of electronic services. 


We have developed a prototype that implements 
a total of three different WSLA services: First, a 
deployment service to provide the measurement and 
condition evaluation services with the SLA elements 
they need to know; second, a measurement service 
that can interpret measurement directives for the 
instrumentation of a Web services gateway and can 
aggregate high-level metrics using a rich set of func- 
tions for arithmetic and time series. Third, a general- 
purpose condition evaluation service has been imple- 
mented that supports a wide range of predicates. Cur- 
rently, we provide extensions to the WSLA language 
that apply to quality aspects of business processes and 
a template format for advertising SLAs in service reg- 
istries such as UDDI. In addition, we are working on 
an SLA editing environment. The integration with 
existing resource management systems and architec- 
tures is currently underway, with a special focus on 
the Common Information Model (CIM). 


Availability 


The SLA Compliance Monitor is included in the 
current version 3.2 of the IBM Web Services Toolkit 
and can be downloaded from http://www.alphaworks. 
ibm.com/tech/webservicestoolkit. 
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ABSTRACT 


HotSwap is a program that provides transparent failover for existing UNIX servers without 
modification or special hardware. HotSwap runs two instances of a server on independent 
machines in sync, so that if either machine fails, the other may assume control without breaking 
TCP connections or losing application state. Replication and failover is transparent to both clients 
and servers. Servers are not aware that a backup replica is maintaining state. Clients are unaware 
that a backup server has taken over from a failed master. This system is applicable to a wide 
variety of common servers including Java, Apache, and PostgreSQL and other servers that may 


have no other mechanisms for fault tolerance. 


Introduction 


Internet server applications must be scalable to 
many users, constantly available, and provide reliable 
service despite server failures, maintenance interrup- 
tions, and disasters. These features are critical in the 
long term, but are usually only considered after initial 
development. Most server applications today are 
developed with inexpensive components that do not 
support reliability, high availability or scalability, or 
any form of fault tolerance. 


Adding fault tolerance to an existing system can 
be difficult and expensive, and may not be possible for 
some kinds of server applications. Many Internet server 
applications are developed using freely available tools 
like Apache, PHP, MySQL, etc. None of these applica- 
tions has built-in fault tolerance. Current techniques for 
adding fault tolerance will allow clients to reconnect to 
a new server if one fails, but connections and state at 
the failed server will be lost. Fault-tolerant database and 
shared file servers are expensive. Implementing fault 
tolerance in a custom server is difficult. 


HotSwap is a program that adds transparent fault 
tolerance to existing servers without modification. 
Application state is duplicated on two independent 
boxes that run in parallel. If one fails the other can con- 
tinue without interrupting client connections. Replica- 
tion is transparent to clients and servers, adding trans- 
parent failover to existing servers without modification. 


There are several other techniques for fault toler- 
ance [CoulourisOl1]. Each makes tradeoffs between 
degree of fault coverage, cost, and abilities. Scalability 
is the ability to increase performance by adding system 
components. Availability is the probability that a system 
is functioning properly at any moment in time. Reliabil- 
ity is the probability that a system will continue to func- 
tion for a fixed period of time. Recoverability is the 
ability of a system to return to a functioning state with- 
out data loss after a failure. Total cost includes hard- 
ware, software, development, training, and maintenance. 
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Disaster recovery is the ability to recover from a large- 
scale disaster like fire or earthquake that can damage a 
wide area. A single system image reduces maintenance 
costs when several components can be maintained as a 
single logical unit. 

Different capabilities compete with each other. 
Adding system components to increase scalability and 
availability will increase cost and may decrease relia- 
bility. Reliability and recoverability require synchro- 
nized backups, which may decrease performance and 
scalability. Disaster recovery requires backups sepa- 
rated by long distances at reduced bandwidth, slowing 
system synchronization and performance. 


Faults may come from software bugs, hardware 
failures, or disasters. Hardware failures should be 
masked by switching to a backup system. Disaster 
may strike a whole building or geographical area. 
Recovering data after a disaster is crucial, but restor- 
ing network connections may require new routing. 


Different servers have different requirements for 
fault tolerance. Web servers use short transactions, 
which can usually be repeated if they fail. Databases 
and file servers that update client data must be careful 
to maintain consistency with their backups. Telecon- 
ferencing and gaming servers maintain real-time state 
in memory. 


Let’s quickly review current techniques for 
adding fault tolerance." 


Periodic Backup 


Periodic backup is the easiest way to add disaster 
recovery. If a system crashes, a new one must be 
rebuilt from backup. Starting a new database server 
and restoring its backup may take some time. Changes 
to the system since the last backup are lost. Running 


applications may have to be shut down to prevent file 


The Aberdeen group has published an excellent compari- 
son of current high-availability products for Linux at http:// 
www. legato.com/resources/whitepapers/A berdeen%20White 
%20Paper%20-%20Linux | .pdf. 
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updates during backup. Connections to the system will 
be lost if it fails, and will have to be restarted when the 
new system is restored. 


There are backup schemes for whole processes, 
not just files. Process checkpointing and hijacking 
[Skoglund00] is a technique of for preserving and 
replicating application state by serializing the entire 
state at some point and restoring it later. A process 
checkpoint consists of its data pages, executable path, 
and system descriptor like open files, file pointers, etc. 
Condor? [Zandy99], [Litzkow97] uses application 
checkpointing to move an application completely from 
one machine to another. Like backing up data files, 
checkpointing is not suitable for real-time synchro- 
nization between two running processes. 


Server Clusters 


A server cluster uses a front-end director to dis- 
tribute client requests over several back-end servers. All 
back-end servers provide a consistent application inter- 
face. Consistency requires that all back-end servers use 
a shared database or file system. If a back-end server 
fails, the director will send new client requests to 
another server in the pool. The new server will retrieve 
the client’s state from the shared database. 


The Linux Virtual Server? project provides a 
director and a framework for back-end servers to use a 
shared Coda file system. F5Networks’s BigIP4 acts as 
a Director monitoring RealServer health and distribut- 
ing requests. BigIP also works redundantly and pre- 
serves client connections and encryption state on fail- 
ures. Cisco’s LocalDirector is a similar product. 


Directors are usually redundant. If one fails, the 
other assumes its IP address to accept new connections. 
Directors monitor the health of the back-end servers and 
remove failed servers from the pool. Directors may ter- 
minate client connections and attempt to repeat client 
requests if a back-end server fails. Directors do not 
address reliability or recoverability. That’s up to the 
back-end servers and common database. 


2http://www.cs.wisc.edu/condor/ . 

Shttp:/Avww.linuxvirtualserver.org/. 

4http://www.fSnetworks.com/. 

Shttp://www.cisco.com/warp/public/cc/pd/cxsr/400/index. 
shtml. 
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It is important to distinguish between availability 
and reliability in a server cluster. Clusters increase 
availability and scalability, but not reliability. If a 
server fails, another may be substituted for new con- 
nections, and thus availability is increased. However, 
once a client is connected to a back-end server, the 
connection’s reliability is determined by the reliability 
of the chain of the director, back-end-server, and com- 
mon database. The more components in the chain, the 
less reliable it is. Consider a server failure part way 
through a long file download. The replacement server 
will not know the state of the connection at the time of 
the failure, and the download must be restarted from 
the beginning. 

Some directors like NetScaler® buffer the client 
transaction requests and will repeat them if the server 
fails. These are built to support short non-interactive 
transactions, like HTTP requests. Interactive or 
unbounded connections cannot be buffered and retried 
at the director level. 


Server clusters are ideal for web server applica- 
tions. There are many simultaneous client connec- 
tions, but they are each short and mostly read-only. 
The client’s session (e.g., a shopping cart) is the only 
information that changes frequently, and that is stored 
on a shared database. None of the servers are expected 
to maintain state themselves. Client connections will 
be broken if a Director or RealServer fails. A broken 
connection is a minor inconvenience unless it’s during 
a long download that must be repeated. 


It is up to the back-end servers to synchronize 
shared state. Server machines may all connect to a cen- 
tral database or file server. If a middle server fails, all 
its state is lost and client connections are broken, but 
new client connections are directed to a new server. If a 
shared database fails, the whole cluster fails. 

Fault Tolerant Applications 

Some commercial database and file servers have 
built-in support for system synchronization and fail- 
over. Strict application consistency can seriously 
degrade performance. Lazy replication improves perfor- 
mance but relaxes consistency guarantees. Database 


Shttp://www.netscaler.com/product/technology.html . 
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Figure 1: Schematic diagram of server cluster. 
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replication is still an active area of research [Patifio00, 
Wiesmann00)). 


Commercial databases like Oracle’ and Solid® 
provide good application-level fault tolerance. Fault 
tolerant file servers include NetApp Filer.® 


Server clusters need a fault-tolerant shared 
database for reliability. However, most fault-tolerant 
applications are expensive. Rewriting an application to 
use a new database is not trivial. Converting an exist- 
ing application usually requires considerable expense, 
training, and maintenance. 


Fault-Tolerant Programming Frameworks 


Developing a fault tolerant server is difficult. 
Another option is to write or rewrite a custom server 
from scratch using a development framework that sup- 
ports fault tolerance like J2EE and Enterprise Java 
Beans (EJB") 


These frameworks work well for the domain they 
were designed for. EJB is designed for writing Web- 
like applications where scripts provide an interface to 
a common database. EJB provides facilities for load 
balancing requests, maintaining client sessions, and 
fault tolerance. 


Close examination of the EJB specification how- 
ever reveals a limitation of EJB fault tolerance: trans- 
actions must be idempotent. If a transaction fails, the 
framework will automatically repeat it. Repeating a 
transaction must be equivalent to doing it exactly 
once. Fetching a record from a database is idempotent, 
but decrementing a balance it not. Applications that 
require non-idempotent operations have to implement 
their own fault tolerance. 


Fault-Tolerant Hardware 


Fault tolerant hardware has good performance, 
but it has been traditionally very expensive. Lately 
though, hosts with all redundant hardware components 
have become available at competitive prices. The Stra- 
tus ftServer" is quite reasonable. 


Other systems use shared hardware like a shared 
SCSI disk to preserve state when a master fails. The 
shared disk itself may be fault-tolerant. However, the 
server’s memory state and all client connections will 
be lost. 


These are excellent solutions for avoiding hard- 
ware failures. They are often expensive though. 
Backup servers are connected by cables, so they are 
cannot be geographically separated. When disaster 
strikes the backup and master will both be vulnerable. 


The different levels of RAID illustrate the trade- 
offs between redundancy, performance, cost, and 


™ww.oracle.com . 
8www.solidtech.com. 
Shttp://www.netapp.com/. 

10h ttp://java.sun.com/products/ejb/ . 
‘thttp://www.stratus.com/products/nt/ . 
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recovery. The simplest, RAID1, mirrors two disks. It 
is completely redundant, and does not increase perfor- 
mance. Higher RAID levels trade scalability for 
redundancy at increased cost. 
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Figure 2: Shared hardware with a single point of fail- 
ure. 
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TCP Connection Migration 


All the servers mentioned so far break client con- 
nections on failure because they do not attempt to pre- 
serve TCP state. TCP connections are managed by the 
operating system itself. Even if an application synchro- 
nizes its internal state with a backup, the backup will not 
be able to reconstruct the sequence numbers, windows 
sizes, and timeouts of the master’s TCP stack because 
that information is in the kemel, not the application. 


There is recent research on moving TCP state to 
another machine. However, this research does not 
address how a client or server should transfer applica- 
tion state as well as TCP state. These systems provide 
facilities for moving TCP connections from a failed 
host to another. However, the server on the new host is 
responsible for ensuring that its state is consistent with 
the failed server. The server must be explicitly written 
to synchronize application state with a backup server. 


[Snoeren01] describes extensions to the TCP 
protocol that allow a server or client to redirect a TCP 
connection to a new server without breaking the con- 
nection. This is handy for load balancing and avoiding 
failed servers. However, The TCP stack at both ends 
must be written to recognize these new TCP packet 
options. Replacement servers must implement their 
own synchronization for application state. 


The reliable sockets system, Rocks [Zandy01], 
can preserve and reconnect TCP connections after link 
failures, address changes, and extended disconnection. 
Rocks does work without recompiling programs. 
However, servers must be rewritten to synchronize 
state. Rocks provides an API for managing and detect- 
ing resumed TCP connections. 


Some redundant server cluster directors can syn- 
chronize both TCP and SSL state with their backups. 
If one director fails, the backup can continue the TCP 
connections. This only applies to the directors, not the 
back-end servers. A back-end server’s TCP state is 
lost on failure. 


[Alvisi00] proposes a system that allows a server 
to keep its TCP connections open until it restarts after 
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failure. If the application is written to exploit this fea- 
ture and it can reconstruct its state after failure, it can 
avoid closing client connections. However, the server 
must still be able to reconstruct its state after a crash. 


[Aghdaie01] Presents a web server that preserves 
TCP connections and server state over failures. This is 
an example of a web server specially written to trans- 
fer both TCP and application state to a backup. This is 
not a general solution for all servers. 


[Daniel99] presented a system similar to 
HotSwap. It used ptrace() to catch, redirect, and syn- 
chronize system calls between servers. However, that 
system used a central modified NFS server to provide 
a common file system. HotSwap does not use a shared 
file system; all servers are totally independent. 


HotSwap 


HotSwap maximizes availability and reliability 
by providing a hot backup server that maintains a 
complete independent copy of a master server’s state. 
The backup is complete and dynamic, so it can take 
over all client connections when a server fails without 
interruption. It minimizes cost by adding this ability to 
almost any server without modification or special 
hardware. Both master and backup appear as a single 
computer on the network, and thus HotSwap provides 
a single system image. The tradeoff is a small amount 
of overhead to keep the master and backup servers in 
sync. HotSwap does not address scalability; backup 
servers do not share the load. 


How it Works 


HotSwap starts two identical instances of the 
same set of programs on two independent machines, a 
master and a backup. The programs are started from 
the same initial state, with duplicate file systems. As 
they run, HotSwap ensures that both copies are syn- 
chronized. 


Burton-Krahn 


Synchronization means that both the master and 
backup programs see exactly the same input and pro- 
duce exactly the same output. When a client connects, 
both servers receive the new connection. When a 
client sends data, both servers receive it. When the 
master server makes a system call, like requesting the 
current time, HotSwap ensures the backup gets the 
same value. In this way, both servers will go through 
the same sequence of state transitions and produce the 
same output. The master sends its output to the client. 
The backup verifies it would produce the same output 
as the master, and then discards its output. 


If the master fails to produce output, or if it 
detects an internal error, the backup takes over. The 
backup can take over immediately since it is already in 
the same state as the master was before the master 
failed. The backup simply stops discarding its output. 
This is how HotSwap achieves transparent failover: 
the backup produces exactly the same output as the 
master would at any moment but discards its output 
until the master fails. 


Both servers must start in the same initial state. 
To start the system, the two HotSwap processes syn- 
chronize their file systems then execute their server 
programs. After a failure, a new backup server can 
synchronize files without interrupting the surviving 
master. The operator can later choose when to restart 
the master and new backup to achieve full fault toler- 
ance again. 


Details of Synchronizing State 


Synchronizing server state is critical. HotSwap 
requires that a server will produce the same output if it 
receives the same input from a particular set of 
sources. HotSwap synchronizes system calls like 
time(), getpid(), socket(), recv(), send(), etc. These are 
all the inputs used by our pilot servers, and they are 
enough to ensure that the servers we have tested are 
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Figure 3: How HotSwap works. 
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synchronized. Some servers may use other sources of 
input like inode numbers or direct access to hardware 
that HotSwap cannot catch or synchronize. HotSwap 
may not be able to synchronize these servers, but it is 
likely that the servers themselves could be modified 
slightly to work with HotSwap to synchronize. 


HotSwap synchronizes system calls by relinking 
programs before they run. In UNIX and Windows, 
applications are dynamically linked against libraries 
that provide system calls. HotSwap inserts a shim 
library that redefines system calls to transfer control to 
the running HotSwap program. HotSwap also has to 
synchronize network traffic at the TCP/IP level. 
Clients and servers usually communicate over TCP, a 
network protocol that breaks a stream of data into 
packets that are reassembled in sequence and 
acknowledged. The operating system (OS) is responsi- 
ble for managing TCP connections. When a program 
calls send(), the OS adds that to the outgoing TCP 
buffer, sends a TCP packet and changes the TCP con- 
nection state. A TCP connection has many state vari- 
ables, including packet sequence numbers, timeouts, 
buffered packets, and acknowledgements. HotSwap 
must synchronize these state variables, so it uses its 
own TCP/IP network stack. 


HotSwap constantly intercepts and synchronizes 
a subset of system calls, just enough to ensure syn- 
chronization between processes. It also runs in a file- 
system-only mode where is just synchronizes changes 
to the local file system. This allows a backup to main- 
tain an active backup of the master’s file system with- 
out incurring the extra overhead of full process syn- 
chronization. This mode is used for disaster with a 
geographically remote backup. 


HotSwap — Transparent Server Failover for Linux 


Limitations 


HotSwap relies on each server receiving input 
only from the set of system calls that HotSwap moni- 
tors and synchronizes. HotSwap synchronizes all sys- 
tem calls to sockets, time, file stats, process ids, 
semaphores, and the /dev/random device. HotSwap 
cannot synchronize state if a server receives input 
from hardware devices or from the timing of asyn- 
chronous signals. 


The master and backup systems must start at the 
same time with identical file systems to ensure they 
receive the same input from local files. HotSwap runs 
chroot()’ed in a server directory to minimize the 
amount of files required to synchronize. The chroot() 
has the extra advantage of improving security by lim- 
iting the server’s access to the file system. 


After a server fails, the survivor will continue 
running by itself as a solo master. A new backup sys- 
tem will continue to synchronize files with the running 
master, but will not synchronize applications until the 
running master restarts. If the running master fails, the 
backup will restart with a synchronized file system, 
but existing connections and transactions will be lost. 


Results 


HotSwap has been tested with Perl, Java, Python, 
Apache, OpenSSL, OpenSSH, and PostgreSQL under 
Linux. 


The first test was replicating a simple web server 
written in Perl to serve video files. The master was dis- 
connected in the middle of a video, and the client con- 
tinued displaying the video from the backup without 
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interruption. The same test was then performed using 
Java, Python, and Apache with good results. 


The OpenSSL s server program was tested to 
evaluate an encrypted web server. Encrypted connec- 
tions are usually difficult to replicate, since each side 
must maintain identical encryption state, and_ that 
changes with every byte processed on an encrypted 
stream. However, as long as the master and backup use 
the same seed values to generate their initial session 
keys, they should maintain the same state. Unfortu- 
nately, our initial test with OpenSSL failed! Close 
examination of the OpenSSL source revealed that 
OpenSSL used an uninitialized buffer plus bytes from 
/dev/urandom for a seed. This highlights a limitation of 
HotSwap; processes must use only input from synchro- 
nized system calls. Fortunately, we easily patched the 
OpenSSL server to initialize its buffer and use only 
/dev/urandom, and the test succeeded. We _ tested 
s_server to measure the overhead for just intercepting 
and monitoring system calls and full replication. We 
measured the time required to download various sizes of 
files to see how the overall bandwidth of the server was 
impacted by replication, using commodity hardware. 


The results show that intercepting and monitor- 
ing system calls only introduces a 1.4% overhead, and 
full replication to another box reduced bandwidth by 
only 9.6%. 


The OpenSSH tests demonstrated that HotSwap 
really does provides a single system image where 
master and backup appear as one computer. We used 
ssh to log in and edit files and scp to upload files. All 
these actions were replicated on both the master and 
backup simultaneously and transparently. 


Replicating PostgreSQL demonstrated that Hot- 
Swap can immediately add transparent failover to a 
database without modifying the database itself. 


Conclusion 


HotSwap has unique properties. It adds transpar- 
ent failure and a single system image to servers with- 
out any shared components. Backup servers maintain 
identical file-system and internal memory states with 
the master. Client connections are never lost or broken 
on server failure. Servers do not have to be modified, 
with few exceptions. The price of fully transparent 
replication is a small amount of overhead. 


Availability 


HotSwap will be available in server kits that 
include a complete tested server and the minimal root 
file system required to support them. We expect 
HotSwap server kits to be available for download 
from www.hotswap.net in October, 2002. 
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ABSTRACT 


As the security threats on the Internet are becoming more prevalent, firewalls and other 
forms of protection are becoming more commonplace. Unfortunately, improperly configured 
firewalls can cause a variety of problems. One particularly nasty problem is when a firewall 
administrator chooses to use — or continue using — Path MTU Discovery (a good choice in most 
situations), but blocks packets required for the protocol to work: ICMP type 3 code 4 packets. This 
problem, the Path MTU Discovery Black Hole, has been discussed many times before. However 
with under- 1500 MTU protocols such as PPPoE becoming common for both home and business 
high-speed connections, this problem is affecting more people than ever before. 


Introduction 


With the rise of security threats due to hackers, 
script kiddies, and viruses, the use of firewalls is becom- 
ing more widespread. This is a positive trend, but as 
adding firewalls to a network becomes a more common 
task, other problems inevitably arise. Configuring fire- 
walls without proper knowledge of networking proto- 
cols can keep out more than one bargained for. This 
paper describes one common problem caused by apply- 
ing overly strict packet filters incorrectly. Causes are 
examined and solutions are presented and analyzed. 


To Filter or Not To Filter 


Firewalls in their most simple form are IP routers that 
can be told which packets to forward and which pack- 
ets to drop. This task is generally called packet filter- 
ing. Deciding what to filter and what not to filter is the 
hardest part of setting up a firewall. Unless the exact 
makeup of a network and all its IP applications is 
known, trial and error is the only way to find out 
which traffic should be allowed through. Since the 
idea is to increase security, denying everything unless 
specifically allowed is the general policy. Having a 
detailed knowledge of the network you are attempting 
to protect is critical to deploying an effective firewall. 


Internet Control Message Protocol 


Certain applications have proven to be more dan- 
gerous than others. In the early 90’s, most vulnerabili- 
ties were found in programs like sendmail and ftpd. Fil- 
tering access to these programs was not always possible 
since they provided a direct service to end users. 
Instead, an upgrade of the software was needed to 
divert the attention of crackers elsewhere. In 1996 after 
the release of Windows 95 another type of problem sur- 
faced. The ping of death revealed an oversight of many 
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operating system vendors to check the validity of an 
Internet Control Message Protocol (ICMP) echo request 
packet. This caused many machines to crash. Later in 
1998, the smurf attack used ICMP echo requests to 
flood a network by pinging a broadcast address. Since 
ICMP does not directly offer a service, filtering out 
ICMP packets seemed like a reasonable option to pre- 
vent these attacks. This completely ignored the function 
of ICMP in the TCP/IP suite. The main purpose of 
ICMP packets is error handling: letting a host know 
when there is a problem in the communication. The 
ICMP echo (ping) function can be used for debugging 
but is in fact far less critical. 


Path MTU Discovery Black Hole 


When two hosts set up a connection over the 
Internet using the TCP protocol, each end may let the 
other know what its maximum segment size (MSS) is. 
This MSS is derived from the maximum transfer unit 
(MTU) of the local interface by subtracting 40 bytes 
for the TCP/IP header. If somewhere along the way an 
IP packet does not fit in the MTU of the next link, the 
router handling the packet will fragment it. That is, if 
Path MTU Discovery is not used. 


Fragmenting packets puts a strain on Internet 
routers, and it also degrades the overall performance 
of a connection. To overcome these problems, Path 
MTU Discovery (PMTUD) was proposed in 1988. It 
is now an Internet standard described in RFC 1191 [1]. 
PMTUD states that when two hosts communicate over 
TCP, the Don’t Fragment (DF) bit is set. This forces a 
router that wants to send a large packet over a link that 
is too small to drop the packet and notify the sending 
host by sending an ICMP type 3 code 4 message. This 
message says the destination is unreachable, because 
your packet is too large and I may not fragment it. In 
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addition to this standard ICMP message RCF 1191 
adds to it: the MTU of the next link is x bytes. This 
way the sending host can adjust the MSS for the con- 
nection and re-send the data. 


Since 1988 almost all operating systems have 
adopted the recommendations of RFC 1191 and use 
Path MTU Discovery when communicating via TCP. A 
problem arises when PMTUD is enabled, but incoming 
ICMP type 3 code 4 messages are filtered by a firewall. 
Since the sending host is never properly notified of any 
problem with the size of the packets, it will not adjust 
its MSS. Communication with the other host will fail. 
This is known as the Path MTU Discovery Black Hole 
problem and is described in detail in RFC 2923 [2]. 


The problems with PMTUD and ICMP filtering 
date from long before RFC 2923. One example is Path 
MTU Discovery and Filtering ICMP [10] explaining 
the issue as early as January 1998. On mailing lists 
like the North American Network Operators’ Group 
(NANOG) the problem has been discussed exten- 
sively, and questions about it return every few months. 


Home PC Home Firewall — ISP broadband router 
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This is because more and more people are being 
affected by the black hole. 


Home and Business Networks and the Black Hole 


Links with small MTU sizes are quite rare in 
core of the Internet. This is perhaps why filtering out 
all ICMP packets does not seem to cause immediate 
problems. However, it is causing problems with net- 
works behind newer broadband connections in both 
homes and businesses (older technologies such as 
SLIP and X.25 are also vulnerable). Techniques like 
xDSL and DOCSIS (data over cable TV service) pro- 
vide an Internet connection that is always on. Com- 
bined with the large bandwidth offered by these ser- 
vices, connecting multiple computers to the uplink 
becomes feasible and rewarding. The high bandwidth 
also requires the need of connecting over a faster 
medium than a serial cable (being either RS-232 or 
USB). Often Ethernet is chosen with PPP over Ether- 
net (PPPoE) as the WAN protocol. PPPoE uses encap- 
sulation to deliver IP packets destined for the Internet 
to the broadband modem via Ethernet. 


Corporate Firewall Corporate Webserver 


Ethernet 


(DROPPED) 


1500 bytes. OF) 


Figure 1: IP connection affected by the Path MTU Discovery Black Hole. 
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Going back to our Path MTU Discovery Black 
Hole problem, the MTU of the PPP interface will have 
to allow for the encapsulation so that the total PPPoE 
packet will fit in the standard Ethernet MTU of 1500. 
PPPoE interfaces therefore have a standard MTU of 
1492. The disaster scenario now becomes clear: 

1. A workstation on the network will start a TCP 
session to, say, a web server on the Internet 

2. The PC sets the MSS to 1460 since the Ethernet 
MTU is 1500 

3. The web server also connects to Ethernet, so it 
replies with an MSS of 1460 

4. The web server enables PMTUD for the traffic 
to the PC 

5. The PC sends an HTTP request (typically a few 
hundred bytes) 

6. The web server starts sending the requested 
file, in 1500 bytes IP packets 

7. The broadband router at the ISP of the end net- 
work cannot fit the packet into the PPP link and 
sends an ICMP type 3 code 4 message to the 
web server 

8. A firewall between the end network and the 
web server drops the ICMP packet (often this is 
the firewall meant to protect the web server, but 
it can easily be any other firewall or router in 
between the two end networks) 

9. The user is unable to browse the web site 


The example uses web browsing and HTTP, but 
it holds true for any TCP communication sending 
messages of more than 1452 bytes at a time (E-mail, 
ftp, etc.). 


Figure 1 shows a home firewall. This can either 
be a device specially designed for this purpose, or a 
generic workstation configured for this task — and 
could just as well be the firewall of a business. With 
the always-on feature of broadband connectivity, set- 
ting up a firewall at home is becoming a must. 


Who Is (Not) Affected 


A link with a small MTU can exist anywhere on 
the Internet. So theoretically everyone can be affected by 
this problem. As explained earlier, home and business 
networks utilizing PPPoE or similar protocols common 
in today’s DSL and cable networks have a higher chance 
of encountering this problem. This increased probability 
does not apply to the following setups: 

e Just one workstation connected to a modem. 

Since the MTU of the PPP interface is used to 
calculate the MSS for each TCP connection, the 
web server will only send packets that will fit 
into the PPP link. 
Home gateways with a public IP address on an 
Ethernet interface. The external Ethernet inter- 
face can directly be used for Internet traffic since 
it has its own public IP address. Since no encap- 
sulation is needed, the MTU used for Internet 
traffic is 1500 (the default Ethernet MTU). 
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e Home gateways connecting to a modem 
using USB. Since USB does not have an MTU 
of its own, the PPP connection can safely use 
an MTU of 1500 or higher. 

Home gateways connecting to a modem 
using PPTP. The Microsoft Point-to-point 
Tunneling Protocol (PPTP) uses a modified 
version of Generic Routing Encapsulation 
(GRE). The MTU of the GRE interface is set to 
1500. Since GRE adds 56 bytes of overhead to 
each packet, it is possible packets will not fit 
into the MTU of Ethernet. In such case, the 
original IP packet is fragmented even if the 
Don’t Fragment bit is set. This is quite nasty 
and lowers performance [1]. It does however 
prevent the Path MTU Discovery Black Hole 
from occurring on account of the PPTP link. 


Cause of the problem 


As mentioned above, the Path MTU Discovery 
Black Hole problem is caused by using PMTUD with- 
out allowing crucial ICMP packets to pass network fil- 
ters. RFC 2923 [2] describes this as an act of over- 
zealous security administrators. It is a sign of the 
times to have very strict firewall policies. Check Point 
Software Technologies is the undisputed market leader 
in firewall solutions. Their FireWall-1 product used to 
ship with the default Policy Properties containing a 
setting to allow all ICMP traffic to pass. When the 
smurf attack hit in 1998, Check Point was publicly 
criticized for allowing ICMP through by default. This 
caused the company to change the default settings to 
disallow all ICMP traffic. Since Path MTU Discovery 
has become a standard TCP/IP feature, when anyone 
now installs an out-of-the box Check Point firewall, 
they introduce the PMTUD Black Hole. It is now left 
to the security administrator to explicitly allow ICMP 
type 3 code 4 packets to the servers that use Path 
MTU Discovery, or turn off PMTUD if they are 
uncomfortable with allowing such ICMP packets into 
their network. 


RFC 2923 [2] mentions in Chapter 3: 


It is vitally important that those who design and 
deploy security systems understand the impact of 
strict filtering on upper-layer protocols. The safest 
web site in the world is worthless if most TCP 
implementations cannot transfer data from it. 


We could not have said it any better. Many authoritative 
sources confirm that allowing ICMP type 3 code 4 pack- 
ets through a firewall does not pose a security risk [3, 9]. 


Size of the Problem 


If all of the above still sounds like a mere aca- 
demic problem, here’s the scary part: not only less 
experienced administrators are blocking all ICMP 
packets while using Path MTU Discovery. Web sites 
of organizations with a focus on security also have this 
problem. Just to name a few: 
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www.securityfocus.com (recently fixed) 
www.cert.org 

Wwww.verisign.com 
Www.counterpane.com 
www.ntsecurity.com 


If you cannot trust such security experts to cor- 
rectly configure a firewall, whom can you trust? Is it 
fair to refer to this behavior as less experienced? 
Could not these administrators be ahead of the game 
by filtering something that may soon become a secu- 
rity hole? In fact, there is a way for administrators to 
successfully block ICMP type 3 code 4 packets from 
entering their network without breaking things. Block- 
ing these packets without taking the proper precau- 
tions however, is not acceptable for security profes- 
sionals administering firewalls. Proper solutions are 
discussed below. 


Solutions 


Since this problem has been around for quite a 
while, different solutions have been developed. Interest- 
ingly enough, even though the problem is caused by 
misconfigurations at the server side, most solutions are 
aimed at modifications of the clients. Apart from the 
moral discussion of this, it makes little sense implemen- 
tation wise. If one popular server is misconfigured, all 
users behind a small MTU link wishing to use this 
server will have to adjust their settings. It would be 
much easier if the users could convince the maintainers 
of the broken site to solve the problem at its source. 
The truth is that this is not an easy task. This is why 
solving the issue at the client side is so popular. 


The first three solutions we present depend on 
the cooperation of the (security) administrators of the 
websites with a misconfigured firewall. Only if this 
cannot be achieved, should one look at the things that 
can be done on the client side. 


Allow ICMP Type 3 Code 4 Packets To Reach the 
Servers 


The simplest solution is to allow Path MTU Dis- 
covery to work as it was intended: set the Don’t Frag- 
ment bit on all packets and allow ICMP type 3 code 4 
messages to reach the server. This means changing the 
overly strict rules on firewalls and other active packet 
filters. It should be noted that this is not considered a 
security risk by many authorities [3]. However, if a 
firewall administer feels that allowing such packets is 
more risk than it is worth, there are other solutions. 


Disable Path MTU Discovery 


If allowing ICMP into a network is not an option 
or cannot be achieved, the right thing to do is disable 
Path MTU Discovery on all servers that cannot 
receive ICMP type 3 code 4 packets. Since receiving 
these packets is a requirement for PMTUD to work 
[1], it breaks RFC standards and simply makes no 
sense to have PMTUD enabled on these servers. How 
to disable this feature depends on the operating system 
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of the server. Cisco published a page with setting for 
some popular operating systems [4]. It is worth noting 
that disabling PMTUD to solve the PMTUD Black 
Hole will cause fragmentation. PMTUD was intro- 
duced to maximize performance by minimizing frag- 
mentation [1]. Reintroducing fragmentation should be 
considered only if the previous solution is not feasible. 


Path MTU Discovery Black Hole Detection 


§2.1 of RFC 2923 [2] recommends the implemen- 
tation of a PMTUD Black Hole detection mechanism. 
This is done by turning off the DF bit when retransmit- 
ting TCP packets. Various TCP/IP stacks now imple- 
ment this detection scheme, but it is not turned on by 
default. The very nature of this solution (retransmis- 
sions) results in lower performance. Since it requires 
changes on the server side anyway, it makes more sense 
to turn off Path MTU Discovery altogether. 


Using a Proxy Server 


If a server is suffering from the Path MTU Dis- 
covery Black Hole, and it cannot be fixed there are 
some things that can be done on the client side that 
will prevent the Black Hole from acting up. For web 
browsing for example, it is possible to use a proxy 
server that does not suffer from the PMTUD problem. 
The proxy will then retrieve the pages on the client’s 
behalf, repacking it into smaller TCP packets. Of 
course this only solves the problem for protocols that 
can be proxied. 


Lowering MTU/MSS of the Internal Network 


Another option is to lower the MTU of the client 
to the MTU of the smallest link between the client and 
the server. This way, the client will advertise a smaller 
MSS indicating to the server that its packets should 
not exceed this size. The same result can be achieved 
by lowering the maximum MSS value that a host will 
advertise [4]. This solution will not solve all problems. 
While, the MTU of the uplink is probably known and 
can be used as a guideline for the MTU of the systems 
on the LAN, one cannot be sure that this will always 
be the smallest MTU of the path between the clients 
and a server. If a smaller MTU exists on this path, 
ICMP type 3 code 4 messages will be sent to the 
server and the connection will still fail. Additionally, 
non-TCP protocols like UDP and IPSec will still suf- 
fer from the PMTUD Black Hole. 


MSS Clamping 


The solution of lowering the MTU on all systems 
of the LAN sounds feasible when all means less than 
five. If there are a dozen or more systems, this 
becomes a rather gruesome task. Several solutions 
exist to automatically adjust the MSS of TCP packets 
when they are being routed by the internal gateway. 
This is a particularly nasty solution. Per definition a 
router should not interfere with end-to-end settings 
like the MSS. Additionally, some protocols like IPSec 
will break when the MSS is changed in midcourse. 
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There are several implementations of this hack: 

e --clamp-mss-to-pmtu switch for IPTables in 
Linux 2.4.x kernels [5] 

e CLAMPMSS setting of Roaring Penguin’s 
PPPoE Software [6, 13] 

e mssfixup command of ppp for FreeBSD [7] 


This solution suffers from the same problems as 
above: there is no guarantee that the uplink MTU is 
the smallest in the path (even if it is, this only works 
for TCP). 


The MSS Initiative 


In an attempt shift the focus to the cause of this 
issue rather than the effect, we started The MSS Initia- 
tive [8]. The purpose of this initiative is to raise 
awareness of systems administrators about the Path 
MTU Discovery Black Hole problem. We believe that 
when enough security administrators realize that 
blocking ICMP type 3 code 4 packets breaks one of 
the core IP protocols, they will adjust the rule sets of 
the devices they manage or turn off PMTUD. Grue- 
some hacks like MSS clamping will then become 
unnecessary. The MSS Initiative maintains a list of 
sites that are currently suffering from the Path MTU 
Discovery Black Hole and attempts to notify the 
administrators of those sites. This works in two ways: 
end users can check the list to see if a site they cannot 
reach is misconfigured, and hopefully administrators 
will take action upon receipt of the notice they receive 
from us. We also offer to help administrators unsure of 
how to fix their setup. 


Determining if a site suffers from the Path MTU 
Discovery Black Hole can be difficult. It is very easy 
to mistake other network problems for this one. Users 
are encouraged to follow the instructions detailed on 
The MSS Initiative website if they believe a site is 
suffering from the PMTUD Black Hole. Users may 
then report the site to the Initiative so it can be added 
to the list and the administrator contacted. 


Conclusion 


Packet filters and firewalls have become neces- 
sary tools to protect systems against the growing hostil- 
ity on the Internet. At the same time these tools them- 
selves, if not configured properly, pose as a threat 
against one of the core protocols of the IP suite. In an 
ideal world, everyone would follow the guidelines set 
forth by Internet standards and RFCs. In a diverse and 
disjoint society like the Internet this cannot be expected 
to happen. However, when some of these standards are 
violated by a large number of sites and even some 
important vendors and security specialists fail to follow 
them correctly, things do break. It is in the nature of the 
users of the Internet to find a way around the problems 
that arise. Fixing things locally is attractive because of 
the speed and control that can be achieved, but it also 
allows the real problems to persist. 
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It is time make an effort to correct the problem 
that has been explained in this paper. If we do not, we 
might have to abandon the usage of Path MTU Dis- 
covery in the near future. This is neither efficient nor 
practical since it is also one of the core protocols of 
IPv6 [11, 12]. 
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ABSTRACT 


We present an approach that addresses the problem of securing software configurations from 
the security-relevant actions of poorly built/faulty installation packages. Our approach is based on 
a policy-based control of the package manager’s actions and is customizable for site-specific 
policies. We discuss an implementation of this approach in the context of the Linux operating 


system for the Red Hat Package manager (RPM). 


Introduction 


Management of software installations has been 
one of the biggest problems facing system admin- 
istrators.! Significant progress [6] has been made in 
some aspects of this problem, e.g., management of 
dependencies and conflicts among packages. In other 
areas that concern overall system security and interoper- 
ability with existing applications, package managers still 
fall short, as they make several unrealistic assumptions: 

e System administrators want to treat packages 
as ‘black boxes. Although system administra- 
tors are not interested in the details of installa- 
tion, they certainly care about package installa- 
tion actions that have implications for overall 
system security and operation, e.g., addition of 
new users, modifications to boot-time scripts, 
addition of entries to crontab, modifications of 
system libraries, changes to global configura- 
tion files (e.g., /etc/inetd.conf) or application- 
specific configuration files or in the case of 

Windows, changes to system registry. 

¢ Package installation steps operate correctly. Few 
provisions exist for dealing with poorly-written 
packages that crash in the middle of the installa- 
tion process. Typically, such crashes would result 
from unanticipated conditions (involving the 
configuration of the system on which the pack- 
age is being installed) encountered by install 
scripts that are included with many packages. 

System administrators are all too familiar with 

situations when packages can neither be fully 

installed, nor be fully uninstalled, leaving the 
system in an inconsistent state. 

e All software and system configuration updates are 
the result of package installation: \n practice, 
however, system administrators have to frequently 


tThis research is supported in part by a ONR University Re- 
search Initiative grant N000140110967 and NSF grant 
CCR-0098 154. 

'In this paper, we use the term “system administrator” to 
refer to professionals and end-users that perform software 
installation. 
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configure systems or packages by manually edit- 
ing config files. In addition, software is frequently 
installed outside of the package management sys- 
tem, e.g., by downloading and compiling a 
source-code archive. Package managers interact 
poorly with such situations, often ending up over- 
writing critical application files. Although pack- 
age managers can save backup copies of certain 
config files, the sysadmin is usually not alerted 
that the config files have been updated. 


We describe a new approach that augments exist- 
ing package managers such as RPM to overcome the 
above problems. Our approach enables system admin- 
istrators to reason about the security-critical actions of 
an installation/upgrade process, check whether these 
actions are compatible with specified security policies, 
and if so, allow the installation to proceed. During the 
installation, all actions are logged so that they can be 
rolled back in the event of an installation failure.2 We 
have implemented our approach in the form of a tool 
called RPMShield that operates in conjunction with 
RedHat’s package manager (RPM). As compared to 
existing package managers, our approach offers the 
following benefits: 

¢ Policy-based control of package installation 
actions. While existing package managers such 
as RPM allow a system administrator to exam- 
ine package contents and installation scripts in 
detail, this is a cumbersome process and hence 
seldom undertaken. In contrast, our approach 

presents a convenient interface through which a 

system administrator can exert control over 

installation actions that impact system security 
or the operation of existing applications. 
e Interoperability with changes made outside of 

package managers. Our approach provides a 


2Note that a complete rollback is impossible if the installa- 


tion scripts communicate over the network, or when process- 
es unrelated to the installation are allowed to make system 
changes after reading files modified by the installation pro- 
cess. 


219 


An Approach for Secure Software Installation 


convenient mechanism to control updates to 
manually edited files, files shared among multi- 
ple packages, or more generally, files installed 
outside the scope of the package manager. 
¢ Normal-user installation of packages. Individ- 
ual users often want to install packages that are 
of interest to themselves. Since all RPM instal- 
lation actions require super-user privilege, nor- 
mal users are unable to install such packages 
for themselves. Our approach can support this 
capability through the use of security policies 
that limit installation actions so that the changes 
are restricted to a specific user directory. 
Tolerance to failures. Package managers offer 
poor support to revert to original system configu- 
ration when an installation upgrade/process fails. 
Our automatic recovery mechanism reverts the 
system to its original (consistent) state. 


= 
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Figure 1: Approach overview. 
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signatures [7], sandboxing [9], proof-carrying code [11] 
and model-carrying code [12]. 


Overview of Approach 


Our approach, presented in Figure 1 divides 
package installation into two phases. 1) In the pre- 
installation phase, a package is analyzed to determine 
its compatibility with a system administrator’s poli- 
cies. 2) In the installation phase, where the actual 
package installation takes place in a controlled envi- 
ronment. Each of these phases is described below. 


Pre-installation Phase 


The pre-installation phase (see Figure 2) consists 

of the following steps: 

° Generation of behavioral models. In this phase, 
a package is analyzed to identify the security- 
relevant actions that will take place during its 
installation. The analysis involves two steps: 1) 
finding the list of files the package will 
install/upgrade, and 2) capturing the intended 
behavior of its (pre-install and post-install) 
scripts. The first step involves querying the 
package itself as well as the packages database. 
The second step involves learning the behavior 
of the scripts/make files. More details on the 
model generation process is presented in the 
section ‘Model Generation’ 
Consistency resolution. The model generated in 
the previous step is supplied to the consistency 
resolver which checks whether this behavior is 
in accordance with the security policy provided 
by the system administrator. A discussion of 
security policies that could be enforced though 
the system is presented in the ‘Security Policies’ 
section. Consistency resolution is described in its 
own section. 
By performing consistency resolution before installa- 
tion, our approach avoids the time-consuming step of 
actual installation when there is a conflict. In addition, 
all conflicts with the policy are identified and presented 
together, which enables a system administrator to make 
more informed decisions. This contrasts with conflict 
identification during actual installation, when each con- 
flict must be presented individually to the system 
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Figure 2: Pre-installation phase. 
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administrator for acceptance. This cumbersome process 
can lead to “‘click fatigue,” sometimes causing the sys- 
tem administrator to make inappropriate decisions. 


Installation Phase 


While the benefits of pre-installation checks 
were identified above, it is not always possible to 
identify all conflicts statically. Scripts may perform 
complex computations (e.g., creating a file name from 
several command-line arguments or environment vari- 
ables) whose results cannot always be statically deter- 
mined. Such conflicts are dealt with during the second 
phase of package installation (refer to Figure 2), 
namely, the installation phase. 


In this phase, as shown in Figure 3, the package 
manager is allowed to run in an environment where 
the system calls made by the package manager process 
and its children are monitored by RPMShield. These 
system calls are compared with the policy provided by 
the system administrator during the pre-installation 
phase. Typically there are no policy violations in this 
phase, as they would have been handled in the previ- 
ous phase. However, if conflicts do arise, this informa- 
tion is presented to the system administrator. If the 
violation is accepted, then package installation pro- 
ceeds. If not, installation is aborted, and the system 
state is restored as it was prior to the installation. The 
runtime-checking mechanism is described in the ‘Run- 
time Interception’ section. 


Description 


This section elaborates on the various compo- 
nents that were introduced in the preceding high-level 
discussion. 

Security Policies 


We use an expressive policy language that, in 
addition to capturing conventional access-control 
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policies, can also express context sensitive policies 
such as “this application cannot modify files owned 
by other applications,” or history sensitive policies 
such as “the installation process can only delete the 
files it has created.’’ Access control policies and con- 
text sensitive policies are specified conveniently 
through a GUI, as shown in Figure 3. In this figure, 
the system administrator can specify whether the 
package installation process can possess the corre- 
sponding capabilities. The first row describes a capa- 
bility where a package can create files in directories 
not owned by itself; In the second row, policy specifi- 
cation for writes to files that are owned by the pack- 
age, but modified from the original installation (e.g., 
config files) are shown; and writes to files owned by 
other packages are shown in the succeeding row. 


Similar capabilities that could be specified using 
the GUI include the ability to add users, perform net- 
work operations, update shared libraries, modify sys- 
tem services (e.g., files in the /etc/rc.d/* directories), 
execution of arbitrary system commands and so on. 
History sensitive policies are currently not expressed 
through the GUI, but could be specified using our 
underlying expressive policy language [13]. 

These security policies are internally represented 
as extended finite state machines. An finite state 
machine consists of states and transitions. The states 
of these machines correspond to various program 
points and the transitions are over an alphabet of sys- 
tem calls with their arguments. There are condition 
guards associated with transitions. Whenever, a condi- 
tion guard is enabled, an optional action associated 
with the guard is triggered. A simple example of an 
extended finite state automaton is presented in Figure 
5, which illustrates the use of these automata in keep- 


ing track of the number of bytes written to a file 
(denoted by MY_FILE). 
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In this figure, whenever an open system call is 
made, the condition guard associated with this transition 
checks whether the file opened is MY_FILE, and if so, 
the file descriptor is stored in the variable X. Later in the 
program, if a write operation is performed, the condition 
guard associated with this transition checks whether the 
file descriptor equals the descriptor stored in X, and if 
so, increments the variable count. (For simplicity, we do 
not show the states and transitions corresponding to the 
invocations of the close system call). The policy repre- 
sented thus is a simple example of a history sensitive 
policy, and the variables that are associated with the 
transitions enable such policy specifications. 


For more information on our work in compiling 
high level specifications into extended finite-state 
machines, we refer the reader to [13]. 
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Model Generation 


Model generation involves determining the secu- 
rity relevant actions of the scripts and obtaining the list 
of files the package plans to modify/upgrade. The latter 
is obtained directly by querying the package, so we 
describe the analysis of scripts in the following section. 


Scripts 


There are several approaches that address the 
problem of analyzing a shell script to determine its 
behavior. A static analysis based approach is one 
which would analyze the actions of a script without 
executing it. It usually involves parsing the input 
script, and interpret the script’s actions to analyze its 
behavior. Such an approach could build on modifying 
the shell interpreter and performing operations in an 
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Figure 4: Screen shot of RPMShield. 


write(des,buf, count) 


FILE == $MY_FILE 
—>{X=fd} 


des == x —> {count ++} 


Figure 5: An extended finite state automaton. 


222 


2002 LISA XVI - November 3-8, 2002 — Philadelphia, PA 


Venkatakrishnan, Sekar, Kamat, Tsipa & Liang 


abstract domain (a general technique called abstract 
interpretation [8]). 

The main advantage of a static analysis based 
approach is that it has the ability to reason about the 
actions of the program at a level closer to the program 
source. However, there are a few disadvantages with 
such an approach. Shell scripts heavily depend on the 
environment and redirection for their successful com- 
pletion. Any static analysis based approach has to 
approximate the environment and the effects of redi- 
rection. Thus, the behavior obtained through analysis 
is usually incomplete. 


In addition, there is one more practical imple- 
mentation problem: The number of shell interpreters 
that are available in a general purpose system abound. 
An analyzer that has to deal with an arbitrary script 
needs to have a front end that would support the 
idiosyncrasies of the syntax of a number of languages 
(bash, csh, perl to name a few!). Clearly, this is not a 
desirable situation. 


We follow an alternate approach that follows the 
program behavior learning approach. In _ this 
approach, we intercept the system calls of the shell 
script. We inspect the system call and its arguments, 
and allow it based on whether it is trying to perform a 
security-critical operation. We allow access to all 
operations that are not security-critical: for example, 
reads to non-sensitive files, creation of temporary 
files, execution of commands that do not alter the sys- 
tem state and so on. When the program performs 
write-operations or security critical operations like 
adding a system service or a user, we simply fake the 
return value of the system call. 


By faking, we mean returning success without 
executing the corresponding system call. Thus, the 
original operation is not performed, and the trace that 
is generated is the model of the program capturing the 
intended behavior of the program. All the environment 
related information is available to this approach 
(unlike the previous approach). See the ‘Examples’ 
section for an example of a model that is generated. 


Consistency Resolution 


The consistency resolution step involves check- 
ing of the model that is generated from the previous 
step against the security policies of interest. This step 
involves two operations. The first operation involves 
checking whether the scripts in the package conform 
to the security policy. The second operation involves 
checking whether the file creation / update operations 
of the package is in accordance to the policy. 


Script Checking 


Checking whether the execution of a script will 
violate the given security policy is an interesting prob- 
lem. The model generated for the shell script contains 
the trace of the script execution. The security policy 
(discussed in its own section earlier), is represented in 
the form of an extended finite state automaton. The 
policy proscribes all the invalid traces. The model, 
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obtained as an output of the previous step, is presented 
as an input string to this automaton. If the trace execu- 
tion is a valid string that is accepted by this automa- 
ton, then we can conclude that the execution of this 
script will violate the specified policy. 


As an example, consider an installation script 
that adds a new user to the /etc/passwd file, while the 
policy allows no such addition. This conflict informa- 
tion (including the specific action and/or file that 
caused the conflict) is presented to the system admin- 
istrator, who may decide to abort the installation, or 
refine the policy to eliminate the conflict, e.g., permit 
addition of user to the password file. 


Checking of File Updates 


The consistency resolver detects any conflicts 
between the policy and the package behavior model, 
e.g., if the policy allows updates only to those files 
owned by a package, then a conflict will arise if the 
package updates a file that has been updated manually 
or by a different package. Similar violations are 
reported for creating files in directories that are not 
owned by the package, deletion of files and so on. 


Runtime interception 


The installation phase is realized by the follow- 
ing components. 

e System call interception environment. System 
calls and arguments are forwarded to the policy 
enforcement engine using a ptrace-based system 
call interception facility that we had developed 
earlier for Linux [10]. This infrastructure pro- 
vides the facilities for one program to inspect 
another (target) program whenever the target pro- 
gram performs system calls; check the arguments 
to the system call; and, if necessary, fake the 
return value without executing the system call. 

e Policy enforcement engine. The policy enforce- 
ment engine is implemented as an extended 
finite state machine. which was discussed 
These automata enforce these policies with 
very low overheads (typically under 2%) [13]. 


In addition, the policy enforcement engine 
incorporates rollback logic, which keeps track 
of the files modified by the package manager 
(since the original installation), as well as the 
original contents of these files. If the installa- 
tion is to be aborted, then this information is 
used to reset the system state to what it was 
before the package installation began. 


Other Features 


Our tool has a convenient user interface and 
incorporates attractive usability features, which we 
describe below. 


Query downloads. Most packages have depen- 
dencies, i.e., they can be installed only if certain other 
packages are present in the system. A novel feature in 
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RPMShield facilitates easy installation of such pack- 
ages, by downloading them from RPM mirror servers 
(such as rpmfind.net). The user can configure this set 
of servers. When an installation encounters depen- 
dency requirements, our implementation searches for 
the existence of these dependency packages on any of 
the servers. 


If the dependency package is found, it is reported 
to the user and downloaded at his/her discretion . Of 
course, the downloaded dependency is subjected to the 
same security checks. Finally, after a successful down- 
load, the entire installation process is resumed. 
RPMShield also takes care of transitive dependencies, 
e.g., if a package A is dependent on package B which 
in turn is dependent on package C and so on, then 
both, packages B and C (and further dependencies) are 
downloaded and installed first and finally package A 
is installed. 


The implementation of this feature involves the 
construction of a directed-graph where the nodes rep- 
resent packages and the edges represent the dependen- 
cies between such packages. The installation then 
starts by installing from the nodes farthest from the 
root and then proceeds back to the root. 


Normal User Installation. Since RPM installa- 
tions require root privileges, normal users (who do not 
have root privileges) do not have an opportunity to 
install packages that are of interest to themselves. 
Using a highly constrained security policy, we can 
provide a confined environment where the package 
can be installed by the normal user. However, not all 
packages can be installed by a normal user. The pack- 
ages have to be re-locatable, and must not update any 
system owned files. Due to the nature of this con- 
strained policies, it is not possible to change them 
through the graphical-user interface. (One could how- 
ever change them by modifying the policies in the 
underlying language). 


/sbin/chkconfig --add httpd 
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Examples 


We illustrate our approach by running through 
the installation of the web-server program Apache 
through our tool. We illustrate the various stages of the 
installation process through this example. 


Apache is a popular and freely-available Web 
server. The package consists of a set of installation 
files as well as pre-install and post-install scripts. This 
example pertains to the version of 1.3.20-16 of apache 
server. 


The system first queries whether the package sat- 
isfies all dependency checks. If there are any depen- 
dencies on packages that are not yet installed on the 
system, then the user is queried for downloading of 
these packages from the download sites. These checks 
are run through the same security checks as the pack- 
age. In the following discussion we assume that all 
dependencies are satisfied. 


The pre-installation script is given in Listing 1. 
The script creates a user in the system. Obviously, this 
is a sensitive operation, and the system administrator 
is alerted of this operation. (We do not show the model 
for this script, but present the model for the post- 
install script). 





## Add the "apache" user 
/usr/sbin/useradd -c "Apache" -u 48 \ 
-s /bin/false -r -d /var/www apache \ 
2> /dev/null || : 


Listing 1: Pre-installation script. 





Listing 2 is the post-installation script of apache 
server. The generated model is shown in the Figure 6. 
In this model the states refer to various program 
points, and the transitions refer to various system calls 
with their arguments. Due to constraints on the size of 
the figure, we omit some details such as system call 
arguments for a selected set of system calls. 


## safely add .htm to mime types if it is not already there 


{ -£ /etc/mime.types ] || exit 0 


TEMPTYPES=‘/bin/mktemp /tmp/mimetypes.XXXXXX‘ 


{[ -z2 "STEMPTYPES" ] && { 


echo "could not make temporary file, htm not added to /etc/mime.types" >&2 


exit 1 


grep -v "“text/html" /etc/mime.types 


types=$(grep "“text/html" /etc/mime.types | cut -f2-) 
echo -en "text/html==>[ignored: t]<====>[ignored: t]<====>[ignored: t]<==" 


for val in Stypes ; do 
if [ "$val" = "htm" ] ; then 
continue 
Ei: 
echo -n "Sval " 
done 
echo "htm" 


) > STEMPTYPES 


cat STEMPTYPES > /etc/mime.types && /bin/rm -f STEMPTYPES 
Listing 2: Post-installation script of apache server. 
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This script performs a update to the system ser- 
vices scripts. Also the script updates the file 
/etc/mime.types that is shared by other applications. The 
user is alerted of these operations. The other opera- 
tions performed by the script such a creating and 
deleting of temporary and running other utilities such 
as cut, grep etc., are allowed by the security policy. 


In addition, suppose if this installation of apache 
was actually an upgrade from a previous version, then 
the system keeps track of all the files that have been 
modified since the previous installation. This not only 
includes all configuration files, but other files such as 
local symbolic links, text files, etc. An attempt to 
delete/overwrite these files is presented to the system 
administrator. All through the installation, such files 
are backed up, such that in case the user decides to 
abort the installation, the system can be reverted to its 
original state. 


Applicability to Other Systems 


Although the approach presented in this paper is 
described only in the context of the Linux operating 
system and the RedHat package manager, the overall 
architecture can be ported to other Unix like environ- 
ments with relative ease. We discuss about migrating 
to other package managers below. 


We have used the Java environment for the 
implementation of the graphical user interface. Hence 
this portion of the implementation is portable. The 
model generation involves querying the package and 
hence this has to be customized for the particular 
package format. The consistency checking involves 
querying the package database and this requires cus- 
tomization as per the package manager’s programming 
interface for querying and modifying its database. The 
model generation for scripts and the runtime checking 
steps involve system call tracing. Although, the ptrace 
system call is not completely portable across different 
Unix environments, similar facilities exist for other 
Unix variants, as evidenced by the implementation of 
gdb-like debuggers for these systems. In fact, our sys- 
tem call interceptor [10] provides a uniform program- 
ming interface implementation that abstracts the archi- 
tecture dependencies. Our interceptor has been imple- 
mented for Linux and Solaris. 


Other popular package managers like pkg [3] 
(used on Solaris systems), Ipp [2] (used on AIX 


execve(‘chkconfig’) 
execve (‘mktemp’ ) 
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systems), dpkg [1] (used on Debian Linux distribu- 
tions) have interfaces similar to that of RPM and 
hence our approach is applicable to these package 
managers with the corresponding implementation 
changes that were described above. There are a few 
other package managers like SEPP [4], SLP [5] (used 
on Stampede Linux) that simply do not offer conve- 
nient interfaces to query packages and the package 
databases, and hence are not particularly suitable for 
our approach. 


Conclusion 


In this paper, we have discussed the design of a 
secure software installation framework. Our frame- 
work allows a system administrator to control the 
installation process through a configurable set of poli- 
cies and enforce these policies through static and run- 
time checks. Our future work would include exporting 
the prototype in the context of other operating systems 
and package managers. 
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Figure 6: Model of apache server’s post-install script. 
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ABSTRACT 


The Internet is changing computing more than ever before. As the possibilities and the 
scopes are limitless, so too are the risks and chances of malicious intrusions. Due to the increased 
connectivity and the vast spectrum of financial possibilities, more and more systems are subject to 
attack by intruders. One of the commonly used method for intrusion detection is based on 
anomaly. Network based attacks may occur at various levels, from application to link levels. So 
the number of potential attackers or intruders are extremely large and thus it is almost impossible 
to “profile” entities and detect intrusions based on anomalies in host-based profiles. Based on 
meta-information, logical groupings has been made for the alerts that belongs to same logical 
network, to get a clearer and boarder view of the perpetrators. To reduce the effect of probably 


insignificant alerts a threshold technique is used. 


Introduction 


Intrusion detection today covers a wider scope 
than the name suggests. IDS systems are tasked to 
detect — reconnaissance, break-ins, disruption of ser- 
vices and attempts at any of these activities. IDS sys- 
tems are also expected to identify the perpetrator or 
provide useful clues toward that end. Some IDS sys- 
tems adopt defensive actions when faced with a 
(potential) attack. In classical host-based intrusion 
detection [2] — anomaly based detection techniques are 
employed. Anomaly is detected by comparing the pro- 
file or behavior of entities with their normal profiles. 
Profile is the pattern of actions of subjects on objects. 
The entity-space should be small enough to enable 
profiling of entities or groups of entities. The entity 
itself might be the perpetrator [or, is compromised by 
the perpetrator]. 


Network-based attacks which generally precede 
host break-ins do not lend themselves to easy profil- 
ing. An attack may occur at the link level, network 
level, transport level or at the application level. Thus 
the entities that are potential attackers or intruders are 
not just users with user-ids but link-level entities rep- 
resented by MAC addresses, network level entities 
(represented by network addresses), transport level 
entities (represented by network address and transport 
protocol) or network application level entities (repre- 
sented by network address, transport protocol, and 
port address). This leads to an explosion in the entity 
space making it all but impossible to “profile” entities 
and detect intrusions based on anomalies with respect 
to the profiles. Effectively, all communication — 
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packets or trains of packets need to be examined to 
detect traces of (attempted) mischief. 


If (potential) mischief is detected, identifying the 
perpetrator is a hard task. The perpetrator is generally 
not directly related to the packet or datagram which is 
the only clue to the offender in the network context. 
The source address in the relevant IP datagram may 
vary over time for the same attack due to DHCP 
assignment or, due to deliberate maneuvers by the 
attacker. In one special case the source address may be 
spoofed altogether. 


In a similar manner identifying the target of the 
attack may be difficult — the destinations may range 
over several ports of several hosts and over several net- 
works. The target may be an application, a host, a net- 
work or networks of an organization, region or country. 
The perpetrator may be an application, a host, a net- 
work or even networks from a region or a country. 


Added to the inherent difficulty in identifying the 
perpetrator and/or target is the fact that the rules or sig- 
natures that are employed to detect (potential) mischief 
are simplistic. Presence of the signatures do not neces- 
sarily signify mischief. Moreover, in the open Internet 
there are deliberate mischief makers making a con- 
certed effort to break-in, the unwary user who (proba- 
bly) unintentionally fires off a scan or mischievous pro- 
gram and malfunctioning programs that send off suspi- 
cious looking packets. The end result is a profusion of 
alerts from Intrusion detection systems. When looked at 
in isolation the alerts make little sense, and serve little 
more than log messages destined for posterity. 


To provide greater visibility of (potential) 
attacks, perpetrators and targets, we have devised a 
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model that aggregate entities and actions into logical 
super-entities and actions. Thus a set of host-IP 
addresses, get clubbed into a logical network. And 
attacks from this logical network constitute a larger 
attack. To reduce the noise of (probably) irrelevant 
alerts that effectively hinder the identification of 
actual offenses and offenders, a thresholding tech- 
nique is used. Experimental results shows the effect of 
our model, which is based on logical groupings and 
threshold techniques. 


In the next section, we will discuss about the 
background and related works. Then we will talk 
about our proposed model of Network-based Intrusion 
Detection and subsequently about the observation 
environment of our case study and its evaluation. We 
then conclude our work. 


Background 


It is very important that the security mechanism 
of a system are designed so as to prevent unauthorized 
access to system resources and data. The conventional 
approach to secure a computer or network system is to 
build a protective shield around it. External users must 
identify themselves to enter the system. This shield 
should prevent leakage of information from the pro- 
tected domain to the outside world. But it is not possi- 
ble to design a system which is completely secure. We 
can however try to detect these intrusion attempts so 
that action may be taken to repair the damage and also 
the identification of the perpetrators and their victims. 
If there are attacks on a system we would like to detect 
them as soon as possible, preferably in real-time and 
take preventive measure. This is essentially what an 
Intrusion Detection System (IDS) does. 


Techniques of intrusion detection can be divided 
into mainly two types. Anomaly based detection and 
Signature based detection. Anomaly detection tech- 
niques assume that all intrusive activities are necessar- 
ily anomalous. This means if we could establish a nor- 
mal activity profile for a system, we could, in theory at 
least flag all system states deviating from the estab- 
lished profile by statistically significant amounts as 
intrusion attempts. But there are two possibilities, (i) 
anomalous activities that are not intrusive are flagged 
as intrusive (false positive) and (ii) anomalous activi- 
ties that actually intrusive but not flagged (false nega- 
tive). The second one is obviously more dangerous. 


The main issues in anomaly detection systems 
thus become the selection of threshold levels so that 
neither of the above two problems is unreasonably 
magnified. The concept behind signature detection 
schemes is that there are ways to represent attacks in 
the form of a pattern or a signature so that even varia- 
tions of the same attack can be detected. So in some 
sense, they are like virus detection systems. Able to 
detect known attack patterns but of little use in case of 
unknown attack methods. The main issues in Signa- 
ture detection systems are how to write a signature 
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that encompasses all possible variations of the perti- 
nent attack, and how to write signatures that do not 
also match non-intrusive activity. An interesting dif- 
ference between these two schemes is that anomaly 
detection systems try to detect the complement of 
“bad” behavior, whereas signature detection system 
try to recognize known ‘“‘bad” behavior. 


There are advantages and disadvantages for both 
of the detection approaches. 
¢ Anomaly detection: 

¢ Advantages: No need to configure the sys- 
tem. It automatically learns the behavior of 
a large number of subjects, and can be left 
to run unattended. It has the possibilities of 
catching novel intrusions, as well as varia- 
tions of known intrusions. 

e Disadvantages: It only flags unusual 
behavior, not necessarily illicit one. It 
could pose a problem when two types of 
behavior do not overlap. A system will not 
find anything wrong with a particular user, 
who changes his behavior slowly before 
attack. Updating of subject’s profiles, and 
the correlation of current behavior with 
those profiles is typically a computation- 
ally intensive task, that can be too heavy 
for the available resources. 

e Signature detection: 

¢ Advantages: The system knows for a fact 
which is suspicious behavior and which is 
not. This is a simple and efficient process- 
ing of the audit data. The rate of false posi- 
tive can also be kept low. 

e Disadvantages: Specifying the detection 
signatures is a highly qualified, and time 
consuming task. It is not something that 
“ordinary” operators of the system would 
do. Depending on how the signatures are 
specified, subtle variations of the intrusion 
scenarios can lead to them going unde- 
tected. 


Early work on intrusion detection was due to 
Anderson [1] and Denning [2]. Since then, it has 
become a very active field. Most intrusion detection 
system (IDS) are based on one of two methodologies: 
either they generate a model of a program’s or sys- 
tem’s behavior from observing its behavior on known 
inputs [9], or they require the generation of a rule base 
[8]. A detailed discussion on network intrusion detec- 
tion can be found in [6, 7]. 


Host based intrusion detection to network based 
detection correlates with the shift from single multi- 
user systems to network of workstations. As computers 
and networks get faster, we can process more audit data 
per unit time, but that same computer or network unfor- 
tunately produce audit data at a much higher rate as 
well. Hence the total ration of consumed resources to 
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available resources is, if not constant, at least not 
decreasing at a sufficiently fast pace, that the perfor- 
mance of the intrusion detection system becomes a non- 
issue. The amount of data that need to be processed 
remains as a vital problem for intrusion detection. So it 
becomes much more difficult to detect network based 
attacks than host-based attacks. There is still lack of 
study in the field of coverage, of the intrusions the sys- 
tem can realistically be though to handle. The problems 
are both that of incorrectly classifying benign activity 
as intrusive and called false positive, and that of classi- 
fying intrusive activity as not-intrusive, as false nega- 
tive. These mis-classifications lead to different problem. 
We tried to cover both the issues in this paper. 


Network-based Intrusion Detection Model 


To provide greater visibility of (potential) attacks, 
perpetrators and targets we have devised a model that 
aggregate entities and actions into logical super-entities 
and actions. Thus a set of host-IP addresses, get clubbed 
into a logical network. And attacks from this logical net- 
work constitute a larger attack. To reduce the noise of 
(probably) irrelevant alerts that effectively hinder the 
identification of actual offenses and offenders a thresh- 
olding technique is used. 


The logical groupings are carried out based on 
what we call meta-information or “glue” information. 
This meta-information is in effect a pool of “hints” 
which indicate how the pieces of a puzzle posed by 
the alerts (may) fit together to form a larger picture. Its 
contents are network topological information, organi- 
zational network information, Autonomous System 
(AS) information, routing information, Domain Name 
System information, geopolitical information. Most of 
these components are readily available in the network. 


The threshold is a tunable parameter. It can be 
varied to provide the best visibility. 


The larger picture 


The isolated incidents reported as alerts when 
seen in the context of the meta-information form a 
clearer picture. The application, server and/or network 
that is being targeted becomes clear, the source of the 
attack gets amplified and the relation between scans 
and subsequent attacks begin to emerge. The applica- 
tion of thresholds to filter out the noise makes the pat- 
terns even clearer. The significant effect is that we 
have a much clearer view of the perpetrator, and a 
much deeper understanding of the target of the attack. 


We have carried out case studies on several oper- 
ational networks and verified the effectiveness of the 
approach. 


Case study 


We have carried out a case study by observing 
the alerts generated on three networks. The first, 
observation point 1, is a network connecting 10 com- 
puters, the second, observation point 2 is a network 
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connecting approximately 30 computers, the third, 
observation point 3, connects a large scale campus 
network to the Internet. 


We used Snort [3] to detect suspicious traffic. We 
used about 1200 signatures. The profiles of the potential 
attacks were observed for 183 days. As the meta-infor- 
mation we used the IP-address to AS-number mapping 
available from the routing registry [4], organization to 
network address mapping using the DNS system, and 
organization to country mapping using a locally com- 
piled database (with network sources as input). 


Evaluation 
Clarity From Aggregation of Profiles 


Three separate modes of aggregation are possible 
— source based, destination based and source-destina- 
tion based. We experimented with all three modes and 
found source based aggregation to be the least effec- 
tive. This is expected as more often that not the source 
addresses are spoofed. The effect of aggregation is 
most pronounced when destination based aggregation 
is carried out. 


The advantages of aggregation is twofold. First, it 
can reduce the total amount of data by less than half. So 
for analyzing the data, irrespective of either doing it by 
manually or not, it will be much more easier to handle 
if the volume of data is considerably less. Secondly, 
aggregation will give a much more clearer view of the 
attacker(s) and also about the victim(s). If we see the 
attacks individually it may look like they are coming 
from different places without any relation between 
them. But in aggregation, we can find whether those 
attacks are generated from the same network entity or 
not. Thus it will be easier to locate a perpetrator. Simi- 
larly, for destination (or victims of an attack) it will be 
easier to identify the actual target. For example, in case 
of port-scanning for a particular destination, apparently 
it may look different, but in aggregation we can find 
which destination the attacker is targeting. Because in 
both the cases aggregation can give amplified picture of 
the source and destination of an attack. 


We then compared number of profiles generated 
using conventional methods and proposed methods. 


We define a rate of aggregation as: 


# proposed model profiles 
Aggreg. Rate = 100% x —————_———_——__ 
perenne ne # conventional model profiles 


Table 1 shows the total numbers of profiles for the 
183 days as seen in the data at the three observation 
points. The data for the source-destination based aggre- 
gation is given here. From this table it is clear that due 
to aggregation the amount of data has been reduced to 
less than half indicating greater feasibility of analysis. 
Not only that, it will also bring clarity. The aggregation 
scheme enables one to detect types of attacks which 
could not be detected otherwise. If source IP addresses 
are aggregated, an attack from distributed sources in a 
single network would be more likely to be detected. 
While, if destination addresses are aggregated, an attack 
which apparently looks like aimed at different individual 
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Figure 2: Distribution of the rate of aggregation of profile. 
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Figure 3: Change of the rate of aggregation during the period of observation. 
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destination without any relation, may found to be multi- 
ple destinations of the same network. 


Figure 2 shows the distribution of the rates of 
aggregation per day, where the horizontal axis is the 
rate of aggregation and the vertical axis is the fre- 
quency (the number of days). Figure 3 shows the 
change of the rate of aggregation during the period of 
observation. The horizontal axis is time (183 days) 
and the vertical axis is the rate of aggregation. 


Simple AS based aggregation reduces the num- 
ber of profiles by a factor of approximately 50%. The 
result at observation point 3 shows that such aggrega- 
tion is more effective when Internet traffic is involved. 


We see that the aggregation varies for almost 
everyday as shown in Figures 2 and 3. There are days 
with no aggregation, and days when aggregation rate 
is very high. 








Conventional Proposed 


Aggreg. 






Model Model Rate 
Point 1 10443 4971 47.60% 
Point 2 61459 30030 48.86% 


Point 3 348707 11322] 32.46% 








Table 1: Clarity from aggregation of profiles. 





Another important observation is that the number 
of rejected alerts when thresholding technique is used 
in conjunction with the logical aggregation is much 
smaller than that when thresholding is used in isola- 
tion. This is significant as it implies that the results are 
safer and more accurate. 


In Figure 4 we have shown the effect of thresh- 
old for both IP based and AS based alerts. Here 
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horizontal axis is the Threshold values and vertical 
axis is the percentage of number of alerts that has been 
covered (after ignoring the below-threshold value 
alerts). ‘SIP’ and ‘DIP’ means the ‘Source IP’ and 
‘Destination IP’ respectively. Similarly, ‘SAS’ and 
‘DAS’ means ‘Source AS’ and ‘Destination AS.’ 


Obviously when the threshold value is zero, when 
there is no rejection of alerts, there is no chance of any 
‘false negative.’ Because at that time we are consider- 
ing every single alerts into our account. Increase in 
threshold value (increase the rejected alerts) affect the 
IP based alerts much more than AS based alerts. It 
implies that AS based grouping is more effective for 
getting a more clearer picture of the attacker and the 
victim as well, as it is covering a greater range. At the 
same time it is also an indicator that in our approach the 
chance of ‘false negative’ is very low. Because even 
after increasing the threshold value to a much higher 
degree, the total number of rejected alerts are compara- 
tively lower than that of conventional IP-based 
approach. And in our model, even if it may not reduce 
but there is no scope of increasing ‘false positive’ alerts 
than conventional models. 


Conclusion 


The aggregation technique envisaged in the 
model helps in providing much greater clarity to the 
results. When used in conjunction with the threshold- 
ing technique its effect is very significant. Results 
from the case studies show that it is possible to reduce 
the number of entities that require close attention, to a 
manageably small set. We have also found that in our 
approach the chance of rejecting actual intrusive alerts 
(false negative) is much less, which is considered as a 


100 120 140 160 180 200 


Threshold (number of alerts) 
Figure 4: Percentage of number of alerts for both IP-based and AS-based profiles with different threshold values. 
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far more serious problem than the problem of false 
positive. 


These techniques coupled with network configu- 
ration information and network visualization tech- 
niques [5] are likely to have a significant impact on 
intrusion detection systems. 
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ABSTRACT 


Security vulnerabilities are discovered, become publicly known, get exploited by attackers, and 
patches come out. When should one apply security patches? Patch too soon, and you may suffer 
from instability induced by bugs in the patches. Patch too late, and you get hacked by attackers 
exploiting the vulnerability. We explore the factors affecting when it is best to apply security patches, 
providing both mathematical models of the factors affecting when to patch, and collecting empirical 
data to give the model practical value. We conclude with a model that we hope will help provide a 
formal foundation for when the practitioner should apply security updates. 


Introduction 


“To patch, or not to patch, — that is the question: — 
Whether tis nobler in the mind to suffer 
The slings and arrows of outrageous script kiddies, 
Or to take up patches against a sea of troubles, 
And by opposing, end them?” [24] 


“When to patch?” presents a serious problem to 
the security administrator because there are powerful 
competing forces that pressure the administrator to 
apply patches as soon as possible and also to delay 
patching the system until there is assurance that the 
patch is not more likely to cause damage than it pro- 
poses to prevent. Patch too early, and one might be 
applying a broken patch that will actually cripple the 
system’s functionality. Patch too late, and one is at risk 
from penetration by an attacker exploiting a hole that is 
publicly known. Balancing these factors is problematic. 


The pressure to immediately patch grows with 
time after the patch is released, as more and more 
script kiddies acquire scanning and attack scripts to 
facilitate massive attacks [4]. Conversely, the pressure 
to be cautious and delay patching decreases with time, 
as more and more users across the Internet apply the 
patch, providing either evidence that the patch is 
defective, or (through lack of evidence to the contrary) 
that the patch is likely okay to apply. Since these 
trends go in opposite directions, it should be possible 
to choose a time to patch that is optimal with respect to 
the risk of compromising system availability. Figure 1 
conceptually illustrates this effect; where the lines cross 
is the optimal time to patch, because it minimizes the 
total risk of loss. 


This paper presents a proposed model for finding 
the appropriate time to apply security patches. Our 
approach is to model the cost (risk and consequences) 


tThis work supported by DARPA Contract F30602-01-C- 
0172. 
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of penetration due to attack and of corruption due to a 
defective patch, with respect to time, and then solve 
for the intersection of these two functions. 


These costs are functions of more than just time. 
We attempt to empirically inform the cost of failure 
due to defective patches with a survey of security 
advisories. Informing the cost of security penetration 
due to failure to patch is considerably more difficult, 
because it depends heavily on many local factors. 
While we present a model for penetration costs, it is 
up to the local administrator to determine this cost. 





Bad Patch Risk ———— 
Penetration Risk 


Risk of 
Loss 





Time 
Figure 1: A hypothetical graph of risks of loss from 
penetration and from application of a bad patch. 
The optimal time to apply a patch is where the 
risk lines cross. 





In particular, many security administrators feel 
that it is imperative to patch vulnerable systems imme- 
diately. This is just an end-point in our model, repre- 
senting those sites that have very high risk of penetra- 
tion and have ample resources to do local patch testing 
in aid of immediate deployment. Our intent in this 
study is to provide guidelines to those who do not 
have sufficient resources to immediately test and patch 
everything, and must choose where to allocate scarce 
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security resources. We have used the empirical data to 
arrive at concrete recommendations for when patches 
should be applied, with respect to the apparent com- 
mon cases in our sample data. 


It should also be noted that we are not considering 
the issue of when to disable a service due to a vulnera- 
bility. Our model considers only the question of when 
to patch services that the site must continue to offer. In 
our view, if one can afford to disable a service when 
there is a security update available, then one probably 
should not be running that service at all, or should be 
running it in a context where intrusion is not critical. 


Lastly, we do not believe that this work is the 
final say in the matter, but rather continues to open a 
new area for exploration, following on Browne, et al. 
[4]. As long as frequent patching consume a signifi- 
cant fraction of security resources, resource allocation 
decisions will have to be made concerning how to deal 
with these patches. 


The rest of this paper is structured as follows. The 
next section presents motivations for the models we use 
to describe the factors that make patching urgent and 
that motivate caution. Then, the next section formally 
models these factors in mathematical terms and pre- 
sents equations that express the optimal time to apply 
patches. The subsequent section presents methods and 
issues for acquiring data to model patch failure rates. 
The paper then presents the empirical data we have col- 
lected from the Common Vulnerabilities and Exposures 
(CVE) database and describes work related to this 
study. The paper ends with discussions the implications 
of this study for future work and our conclusions. 


Problem: When To Patch 


The value of applying patches for known secu- 
rity issues is obvious. A security issue that will shortly 
be exploited by thousands of script-kiddies requires 
immediate attention, and security experts have long 
recommended patching all security problems. How- 
ever, applying patches is not free: it takes time and 
carries a set of risks. Those risks include that the patch 
will not have been properly tested, leading to loss of 
stability; that the patch will have unexpected interac- 
tion with local configurations, leading to loss of func- 
tionality; that the patch will not fix the security prob- 
lem at hand, wasting the system administrator’s time. 
Issues of loss of stability and unexpected interaction 
have a direct and measurable cost in terms of time 
spent to address them. To date, those issues have not 
been a focus of security research. There is a related 
issue: finding a list of patches is a slow and labor- 
intensive process [7]. While this makes timely appli- 
cation of patches less likely because of the investment 
of time in finding them, it does not directly interact 
with the risk that applying the patch will break things. 
However, the ease of finding and applying patches has 
begun to get substantial public attention [20] and is 
not our focus here. 
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Most system administrators understand that these 
risks are present, either from personal experience or 
from contact with colleagues. However, we know of no 
objective assessment of how serious or prevalent these 
flaws are. Without such an assessment it is hard to 
judge when (or even if) to apply a patch. Systems 
administrators have thus had a tendency to delay the 
application of patches because the costs of applying 
patches are obvious, well known, and have been hard to 
balance against the cost of not applying patches. Other 
sources of delay in the application of patches can be 
rigorous testing and roll-out procedures and regulations 
by organizations such as the US Food and Drug 
Administration that require known configurations of 
systems when certified for certain medical purposes [1]. 


Some organizations have strong processes for 
triaging, testing, and rolling-out patches. Others have 
mandatory policies for patching immediately on the 
release of a patch. Those processes are very useful to 
them, and less obviously, to others, when they report 
bugs in patches. The suggestions that we make regard- 
ing delay should not be taken as a recommendation to 
abandon those practices." 


The practical delay is difficult to measure, but its 
existence can be inferred from the success of worms 
such as Code Red. This is illustrative of the issue cre- 
ated by delayed patching, which is that systems remain 
vulnerable to attack. Systems which remain vulnerable 
run a substantial risk of attacks against them succeed- 
ing. One research project found that systems containing 
months-old known vulnerabilities with available but 
unapplied patches exposed to the Internet have a “life 
expectancy” measured in days [14]. Once a break-in 
has occurred, it will need to be cleaned up. The cost of 
such clean-up can be enormous. 


Having demonstrated that all costs relating to 
patch application can be examined in the “currency” 
of system administrator time, we proceed to examine 
the relationship more precisely. 


Solution: Optimize the Time to Patch 


To determine the appropriate time to patch, we 
need to develop a mathematical model of the potential 
costs involved in patching and not patching at a given 
time. In this section we will develop cost functions 
that systems administrators can use to help determine 
an appropriate course of action. 


First, we define some terms that we will need to 
take into account: 

* rach iS the expense of fixing the problem 
(applying the patch), which is either an oppor- 
tunity cost, or the cost of additional staff. 

* recover 1S the expense of recovering from a 


Pp 
failed patch, including opportunity cost of work 





1As a perhaps amusing aside, if everyone were to follow 
our suggested delay practice, it would become much less ef- 
fective. Fortunately, we have no expectation that everyone 
will listen to us. 
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delayed. Both this and the next cost may 
include a cost of lost business. 

® Cbreach iS the expense of recovering from a secu- 
rity breach, including opportunity cost of work 
delayed and the cost of forensics work. 

* Py iS the likelihood that applying a given 
patch will cause a failure. 

® Poreach iS the likelihood that not applying a 
given patch will result in a security breach. 


All of these costs and probabilities are parame- 
terized. The costs €parchs &precovers ANd preach ave all 
particular to both the patch in question and the config- 
uration of the machine being patched. However, 
because the factors affecting these costs are so specific 
to an organization, we treat the costs as constants. This 
is constant within an organization, not between organi- 
zations, which we believe is sensible for a given sys- 
tems administrator making a decision. 


The probabilities pyiy and Ppyeach Vary With time. 
Whether a patch is bad or not is actually a fixed fact at 
the time the patch is issued, but that fact only becomes 
known as the Internet community gains experience 
applying and using the patch. So as a patch ages with- 
out issues arising, the probability of a patch turning 
out to be bad decreases. 


The probability pp,eacn is a true probability that 
increases with time in the near term. Browne, et al. [4] 
examined exploitation rates of vulnerabilities and 
determined influencing terms such as the release of a 
scripted attack tool in rates of breaches. However, the 


rate of breach is not a simple function ——————_ 
\Internet Hosts| 


or eve (where N is the number of 


. \InternetHosts| 
hosts or unprotected hosts that a systems administrator 
is responsible for and |/nternetHosts| is the number of 
hosts on the Internet). Not every host with a vulnera- 
bility will be attacked, although in the wake of real 
world events such as the spread of Code Red [6] and 
its variants, as well as work on Flash [26] and Warhol 
[28] worms, it seems that it may be fair to make that 
assumption. 


Thus we will consider both probabilities py, and 
Pbreach 88 functions of time (f), and write them pji/(¢) 
and Pbreach (t). 


Next, we want to to develop two cost functions: 
® Cparch(t): cost of patching at a given time te 
® Cropatch(t): Cost Of not patching at a given time f. 


The probable cost of patching a system drops 
over time as the Internet community grows confidence 
in the patch through experience. Conversely, the prob- 
able cost of not patching follows a ‘ballistic’ trajec- 
tory, as the vulnerability becomes more widely known, 
exploitation tools become available, and then fall out 
of fashion [4]; but, for the part of the ballistic curve 
we are concerned with, we can just consider cost of 
not patching to be monotonically rising. Therefore, the 
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administrator will want to patch vulnerable systems at 
the earliest point in time where Cparch(t) S Cnoparch(?): 


The cost of patching a system will have two 
terms: the expense of applying the patch, and the 
expense of recovering from a failed patch. Applying a 
patch will likely have a fixed cost that must be paid 
regardless of the quality of the patch. Recovery cost, 
however, will only exist if a given patch is bad, so we 
need to consider the expected risk in a patch. Since a 
systems administrator cannot easily know a priori 
whether a patch is bad or not, we multiply the proba- 
bility that the patch induces failure by the expected 
recovery expense. This gives us the function 

Cpatch(t a Pyaitlt )ep recover + natch (1) 

It is possible, although not inexpensive, to obtain 
much better estimations of the probability of failure 
through the use of various testing mechanisms, such as 
having a non-production mirror of the system, patch- 
ing it, and running a set of tests to verify functionality. 
However, such systems are not the focus of our work. 


The cost of not applying a patch we consider to 
be the expense of recovering from a security breach. 
Again, an administrator is not going to know a priori 
that a breach will occur, so we consider the cost of 
recovery in terms of the probability of a security 
breach occurring. Thus we have: 

Cnopatch(t ) = Poreach(t) breach (2) 

Pulling both functions together, a systems 
administrator will want to patch vulnerable systems 
when the following is true: 

Prait(€p.recover T enatch = Pbreach(t) breach (3) 

In attempting to apply the functions derived 
above, a systems administrator may want to take more 
precise estimates of various terms. 


On the Cost Functions 


Expenses for recovering from bad patches and 
security breaches are obviously site and incident spe- 
cific, and we have simplified some of that out to ease 
our initial analysis and aid in its understanding. 


We could argue with some confidence that the 
cost of penetration recovery often approximates the cost 
of bad patch recovery. In many instances, it probably 
amounts to “reinstall.” This simplifying assumption 
may or may not be satisfactory. Recovery from a break- 
in is likely harder than recovery from a bad patch, 
because recovery from bad patch may simply be a rein- 
stall, or at least does not involve the cost of dealing 
with malice, while recovery from getting hacked is 
identifying and saving critical state with tweezers, re- 
formatting, re-installation, applying patches, recovering 
state from backup, patching some more, ensuring that 
the recovered state carries no security risk, and per- 
forming forensics, a non-trivial expense [10]. However, 
it is possible that recovery from a bad patch could have 
a higher cost than penetration recovery — consider a 
patch that introduces subtle file system corruption that 
is not detected for a year. 
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Furthermore, we note that many vendors are 
working to make the application of security patches as 
simple as possible, thereby reducing the expense of 
applying a security patch [22, 25, 29]. As the fixed 
cost of applying a patch approaches zero, we can sim- 
ply remove it from the equations: 

Pyaitlt ) &p.recover Ss Poreach(t) breach (4) 

Alternately, we can assume that recovery from 
being hacked is C times harder than recovery from bad 
patch (C may be less than one). While the math is still 
fairly simple, we are not aware of systemic research 
into the cost of recovering from security break-ins. 
However, a precise formulation of the time is less 
important to this paper than the idea that the time 
absorbed by script kiddies can be evaluated as a func- 
tion of system administrator time. Expenses incurred 
in recovery are going to be related to installation size 
and number of affected machines, so an argument that 
there is some relationship between costs can be made. 
This allows us to state that: 

breach — Cc €n.recover (5) 

We can substitute this into equation 4: 

Prait(t) &p.recover = Poreach(t) Cc €p.recover (6) 

Dividing each side by @,,ecover, We arrive at the 
decision algorithm: 

Pyait(t) = Phreach(t) Cc (7) 

Recall our assumptions that pj,eq,(t) rises with 
time and p,,i/(t) drops with time. Therefore, the earliest 
time f that equation 7 is satisfied is the optimal time to 
apply the patch. 


When to Start the Clock 


While we discuss the value of the equations above 
at given times, there are actually numerous points from 
which time can be counted. There is the time from the 
discovery of a vulnerability, time from the public 
announcement of that vulnerability, and time since a 
patch has been released. Browne, et al. [4] work from 
the second, since the first may be unknown, but the 
spread of the vulnerability information may be better 
modeled from the first, especially if the vulnerability is 
discovered by a black hat. A systems administrator may 
only care from the time a patch is available, although 
some may choose to shut off services known to be vul- 
nerable before that as a last resort, and work has been 
done on using tools such as chroot(2), Janus [12], and 
SubDomain [9, 13] to protect services that are under 
attack. In this paper, we have chosen to start counting 
time from when the patch is released. 


Methodology 


The first thing to consider when deciding to 
experimentally test the equations derived previously is 
a source of data. We considered starting with specific 
vendors’ advisories. Starting from vendor data has 
flaws: it is difficult to be sure that a vendor has pro- 
duced advisories for all vulnerabilities, the advisories 
may not link to other information in useful ways, and 
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different vendors provide very different levels of 
information in their advisories. 


We decided instead to work from the Common 
Vulnerabilities and Exposures (CVE) [16], a MITRE- 
hosted project, to provide common naming and con- 
cordance among vulnerabilities. Since MITRE is an 
organization independent of vendors, using the CVE 
database reduces the chance of bias. Starting from 
CVE allows us to create generic numbers, which are 
useful because many vendors do not have a sufficient 
history of security fixes. However, there are also many 
vendors who do have such a history and sufficient pro- 
cess (or claims thereof) that it would be possible to 
examine their patches, and come up with numbers that 
apply specifically to them. 


Data Gathering 


Starting from the latest CVE (version 20020625), 
we split the entries into digestible chunks. Each section 
was assigned to a person who examined each of the ref- 
erences. Some of the references were unavailable, in 
which case they were ignored or tracked down using a 
search engine. They were ignored if the issue was one 
with many references (e.g., CVE-2001-0414 has 22 ref- 
erences, and the two referring to SCO are not easily 
found.) If there was no apparent patch re-issue, we 
noted that. If there was, we noted how long it was until 
the patch was withdrawn and re-released. Some advi- 
sories did not make clear when or if a bad patch was 
withdrawn, and in that case, we treated it as if it was 
withdrawn by replacement on the day of re-issuance. 


Methodological Issues 


“There are more things in Heaven and Earth, 
Horatio, 
Then are dreamt of in our Philosophy.” [24] 


Research into vulnerabilities has an unfortunate 
tendency to confound researchers with a plethora of 
data gathering issues. These issues will impact the 
assessment of how likely a patch is to fail. It is impor- 
tant to choose a method and follow it consistently for 
the results to have any meaning; unfortunately, any 
method chosen causes us to encounter issues which 
are difficult and troubling to resolve. Once we select a 
method and follow it, our estimates may be systemati- 
cally wrong for several reasons. Cardinality issues are 
among the worst offenders: 

° Vendors rolling several issues into one patch: 
An example of this is found in one vendor 
patch [19] which is referenced by seven candi- 
dates and entries in the CVE (CAN-2001-0349, 
CAN-2001-0350, CVE-2001-0345, CVE-2001- 
0346, CVE-2001-0348, CVE-2001-0351, and 
CVE-2001-0347). 

Vendors rolling one patch into several advi- 
sories: An example here is CVE-2001-0414 
with a dozen vendors involved. The vulnerabil- 
ity is not independent because Linux and BSD 
vendors commonly share fixes, and so an 
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update from multiple vendors may be the same 
patch. If this patch is bad, then in producing a 
generic recommendation of when to patch, we 
could choose to count it as N bad patches, 
which would lead to a higher value for p,,iy, and 
consequently, later patching. If the single patch 
is good, then counting it as N good patches 
could bias the probability of patch failure 
downward. 

Vendors releasing advisories with work- 
arounds, but no patches: An example is 
CVE-2001-0221, where the FreeBSD team 
issued the statement ‘“‘[this program] is sched- 
uled for removal from the ports system if it has 
not been audited and fixed within one month of 
discovery.” No one fixed it, so no patch was 
released. A related situation occurred when a 
third party, unrelated to Oracle, released an 
advisory relating to Oracle’s product along with 
a workaround, and Oracle remained completely 
silent about the issue (CVE-2001-0326). We 
recorded these instances but treated them as 
non-events — our goal is to measure quality of 
patches; if no patch was released, there is noth- 
ing to measure. 


There are several other potential sources of bias. 
We may not have accurate information on whether a 
vendor released an updated patch, because the CVE 
entry points to the original, and the vendor released a 
subsequent/different advisory. This potentially intro- 
duces a bias by reducing our computed probability of 
a harmful patch. 


When patches are not independent, there is bias 
in a different direction; consider if one or more ven- 
dors released a revised update while others did not (for 
example, CVE-2001-0318). We considered each CVE 
entry as one patch, even if it involved multiple ven- 
dors. We chose to record the data for the vendor who 
issued the latest advisory revision (e.g., Debian over 
Mandrake and Conectiva in CVE-2001-0318). This 
potentially introduces a bias towards patches being 
less reliable than they actually are. Systems adminis- 
trators tracking the advisories of one specific vendor 
would not have this potential source of bias. 


It may be difficult to decide if a patch is bad or 
not. For example, the Microsoft patch for 
CVE-2001-0016 was updated six months after its 
release. There was a conflict between this patch and 
Service Pack 2 for Windows 2000. Installing the patch 
would disable many of the updates in Service Pack 2. 
Note that SP2 was issued four months after the patch, 
so there was four months where the patch was harm- 
less, and two months where the patch and Service 
Pack 2 conflicted. We treated it as if was bad for the 
entire six months. 


There is a potential for concern with the number 
of CVE entries we have examined. In the next section, 
we attempt to infer appropriate times to apply patches 
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by observing the knees in the curves shown in Figures 
7 and 9, and these inferences would be stronger if 
there were sufficient data points to be confident that 
the knees were not artifacts of our sample data. 


More data points would be desirable, but obtain- 
ing it is problematic. We found the CVE repository to 
be limiting, in that it was difficult to determine 
whether any given security patch was defective. For 
future research, we recommend using security advi- 
sory information direct from vendors. In addition to 
providing more detail, such an approach would help 
facilitate computing patch failure rate with respect to 
each vendor. 


We do not believe that these issues prevent a 
researcher from analyzing the best time to patch, or a 
systems administrator from making intelligent choices 
about when to patch. However, these methodological 
issues do need to be considered in further studies of 
security patch quality. 


Empirical Data 


In this section, we examine the data we collected 
as discussed in the previous section. We examined 136 
CVE entries, dating from 1999, 2000, and 2001. Of 
these, 92 patches never were revised leading us to 
believe they were safe to apply, 20 patches either were 
updated or pulled, and 24 CVE entries were non-patch 
events as discussed in ‘Methodological Issues.’ Table 
2 summarizes this data. Of the 20 patches that were 
determined to be faulty, all but one (CVE-2001-0341) 
had an updated patch released. Of these, three were 
found to be faulty and had a second update released; 
one subsequently had a third revision released. Table 3 
summarizes the data for the revised patches. 


Table 2: Quality of initial patches. 


Revised or pulled patches 20 
Good revised patches 16 
Re-revised patches 3 


Pulled and never re-released patches 1 

































Table 3: Quality of revised patches. 





Table 4 analyzes the properties of the patch revi- 
sions. The middle column shows the number of days 
from the initial patch release until an announcement of 
some kind appeared indicating that the patch was bad, 
for the 20 patches that were revised. The right column 
shows the number of days from the revised patch 
release until notification that the revised patch was 
faulty, for the three issues that had subsequent revi- 
sions. Three data points is insufficient to draw 
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faulty 
patches 
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Figure 5: A histogram of the number of faulty initial patches. 
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Figure 6: The probability p,,;;(¢) that an initial patch has been incorrectly identified as safe to apply. 
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Figure 7: A cumulative graph showing the time to resolve all issues. 
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meaningful conclusions, so we will disregard doubly 
or more revised patches from here on. We found one 
triply revised patch, occurring seven days after the 
release of the second revision. 


ee wo Subsequent 
Initial revision os 
revision 


(20 data points) (3 data points) 


Maximum 50 6 
l 


Notification 
time in days 


Average 64.2 22.7 


Median VS 
Std deviation 117.0 34.1 


Table 4: Analysis of revision data. 


i 2 
5 





Figure 5 presents a histogram over the 20 revised 
patches of the time from the initial patch release to the 
time of the announcement of a problem with the patch, 
while Figure 8 examines the first 30-day period in 
detail. Figure 6 presents the same data set as a proba- 
bility at a given time since initial patch release that a 
patch will be found to be bad, i.e., an empirical plot of 
Pyait(t) from equation 7. 


Figure 7 plots the days to resolve an accumulated 
number of security issues, while Figure 9 examines 
the first 30-day period more closely. These plots are 
subtly different from the previous data sets in two 
ways: 

e Time to resolution: In the previous graphs, we 

counted time from when the security patch was 
announced to the time the patch was announced 
to be defective. Here, we are measuring to the 
time the defective patch is resolved. Of the 20 
revised patches, 16 provided a revised patch 
concomitant with the announcement of the 
patch problem, two had a one-day delay to the 
release of a revised patch, one had a 97 day 
delay to the release of a revised patch, and one 
defective patch was never resolved. 
No patch ever released: Of the 136 CVE 
entries that we surveyed, 24 never had any 
patch associated with them, and so for these 
plots, will never be resolved. 


Ideally, we would like to be able to overlay Fig- 
ure 6 with a similar probability plot for “‘probability of 
getting hacked at time ¢ past initial disclosure,”’ or 
Poreach(t). Unfortunately, it is problematic to extract 
such a probability from Browne, et al.’s data [4] 
because the numerator (attack incidents) is missing 
many data points (people who did not bother to report 
an incident to CERT), and the denominator is huge 
(the set of all vulnerable nodes on the Internet). 


From Honeynet [14] one may extract a Dpreacn(t)- 
Honeynet sought to investigate attacker behavior by 
placing “honeypot” (deliberately vulnerable) systems 
on the Internet, and observing the subsequent results. 
In particular, Honeynet noted that the lifespan of an 
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older, unpatched Red Hat Linux system containing 
months-old known vulnerabilities could be as short as 
a small number of days, as attacker vulnerability scan- 
ning tools quickly located and exploited the vulnerable 
machine. However we note that this probability may 
not correlate to the system administrator’s site. Site 
specific factors — site popularity, attention to security 
updates, vulnerability, etc. — affect the local Dy,eacn(t), 
and as such it must be measured locally. 
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Figure 8: A close-up histogram of the first class inter- 
val in Figure 5. It shows the number of faulty ini- 
tial patch notifications occurring within 30 days 
of initial patch release. 


106 
104 
102 
100 


patches 98 


resolved 96 


94 
92 


90 
0 32 6 De12-15118 21, 24 27-30 


days from initial patch release 


Figure 9: A close-up cumulative graph of the first 30 
days in Figure 7. It shows the issue resolution 
time for those occurring within 30 days of initial 
patch release. 


After determining local probability of a breach 
(1.€., Poreach(t)), the administrator should apply Figure 
6 to equation 7 to determine the first time ¢ where 
equation 7 is satisfied. However, since Pp-each(t) is dif- 
ficult to compute, the pragmatist may want to observe 
the knees in the curve depicted in Figures 7 and 9 and 
apply patches at either ten or thirty days. 


Related Work 


This paper was inspired by the “Code Red” and 
“Nimda” worms, which were so virulent that some 
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analysts conjectured that the security administrators of 
the Internet could not patch systems fast enough to stop 
them [20]. Even more virulent worm systems have been 
devised [28, 26] so the problems of “when to patch?” 
and “can we patch fast enough?” are very real. 


The recent survey of rates of exploitation [4] was 
critical to our work. In seeking to optimize the trade- 
off between urgent and cautious patching, it is impor- 
tant to understand both forms of pressure, and 
Browne, et al. provided the critical baseline of the 
time-sensitive need to patch. 


Schneier [23] also studied rates of exploitation 
versus time of disclosure of security vulnerabilities. 
However, Schneier conjectured that the release of a 
vendor patch would peak the rate of exploitation. The 
subsequent study by Browne, et al. of CERT incident 
data above belied this conjecture, showing that 
exploitation peaks long after the update is released, 
demonstrating that most site administrators do not 
apply patches quickly. 


Reavis [21] studied the timeliness of vendor-sup- 
plied patches. Reavis computed the average ‘“‘days of 
recess” (days when a vulnerability is known, but no 
patch is available) for each of Microsoft, Red Hat 
Linux, and Solaris. Our clock of ‘when to patch?” 
starts when Reavis’ clock of “patch available” stops. 


Howard [15] studied Internet security incident 
rates from 1989 to 1995. He found that, with respect 
to the size of the Internet, denial-of-service attacks 
were increasing, while other attacks were decreasing. 
The cause of these trends is difficult to establish with- 
out speculation, but it seems plausible that the expo- 
nential growth rate of the Internet exceeded the 
growth rate of attackers knowledgeable enough to per- 
petrate all but the easiest (DoS) attacks. 


In 1996, Farmer [11] surveyed prominent web 
hosting sites and found that nearly two-thirds of such 
sites had significant vulnerabilities, well above the one- 
third average of randomly selected sites. Again, root 
causes involve speculation, but it is likely that this 
resulted from the complex active content that prominent 
web sites employ versus randomly selected sites. It is 
also likely that this trend has changed, as e-commerce 
sites experienced the pressures of security attacks. 


In recent work, Anderson [2] presents the view- 
point that many security problems become simpler 
when viewed through an economic lens. In this paper, 
we suggest that the system administrator’s failure to 
patch promptly is actually not a failure, but a rational 
choice. By analyzing that choice, we are able to sug- 
gest a modification to that behavior which addresses 
the concerns of the party, rather than simply exhorting 
administrators to patch. 


Also worth mentioning is the ongoing study of 
perception of risk. In McNeil, et al. [17], the authors 
point out that people told that a medical treatment has 
a 10% risk of death react quite differently than people 
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told that 90% of patients survive. It is possible that 
similar framing issues may influence administrators 
behavior with respect to security patches. 


Discussion 


As we performed this study, we encountered sev- 
eral practical issues. Some were practical impediments 
to the execution of the study, while others were of 
larger concern to the community of vendors and users. 
Addressing these issues will both make future research 
in this area more consistent and valid, and also may 
improve the situation of the security practitioner. 


The first issue is that of setting the values for the 
constants in our equations, e.g., the cost of breach 
recovery versus the cost of bad patch recovery, and the 
probability of a breach for a given site. These values 
are site-specific, so we cannot ascertain them with any 
validity: 

¢ A web server that is just a juke box farm of 

CD-ROMs is not as susceptible to data corrup- 

tion as an on-line gambling house or a credit 

bureau, affecting the relative costs of recovery. 
° A private corporate server behind a firewall is 

less likely to be attacked than a public web server 

hosting a controversial political advocacy page. 


We wish to comment that the administrator’s 
quandary is made worse by vendors who do a poor job 
of quality assurance on their patches, validating the 
systems administrator’s decision to not patch. Our 
ideas can be easily taken by a vendor as advice as to 
how to improve their patch production process and 
improve their customer’s security. If the standard devi- 
ation of patch failure times is high, then administrators 
will rationally wait to patch, leaving themselves inse- 
cure. Extra work in assurance may pay great divi- 
dends. In future work, it would be interesting to exam- 
ine vendor advance notice (where vendors are notified 
of security issues ahead of the public) and observe 
whether the reliability of subsequent patches are more 
reliable, i.e., do vendors make good use of the addi- 
tional time. 


In collecting data, we noticed but have not yet 
analyzed a number of trends: Cisco patches failed 
rarely, while other vendors often cryptically updated 
their advisories months after issue. The quality of 
advisories varies widely. We feel it is worth giving 
kudos to Caldera for the ease with which one can 
determine that they have issued a new patch [5]. How- 
ever, they could learn a great deal from some Cisco 
advisories [8] in keeping detailed advisory revision 
histories. Red Hat’s advisories included an “‘issue 
date,” which we later discovered is actually the first 
date that they were notified of the issue, not the date 
they issued the advisory. There has not, to our knowl- 
edge, been a paper on “how to write an advisory,” or 
on the various ways advisories are used. 


If one assumes that all problems addressed in 
this paper relating to buggy patches have been solved, 
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the administrator must still reliably ascertain the valid- 
ity of an alleged patch. Various forms of cryptographic 
authentication, such as PGP signatures on Linux RPM 
packages [3] and digital signatures directly on binary 
executables [27] can be used. Such methods become 
essential if one employs automatic patching mecha- 
nisms, as proposed by Browne, et al. [4] and provided 
by services such as Debian apt-get [25], Ximian Red 
Carpet [29], the Red Hat Network [22], and the auto- 
matic update feature in Microsoft Windows XP [18].? 


Conclusions 


“Never do today what you can put off till tomor- 
row if tomorrow might improve the odds.” 
— Robert Heinlein 


The diligent systems administrator faces a 
quandary: to rush to apply patches of unknown quality 
to critical systems, and risk resulting failure due to 
defects in the patch? Or to delay applying the patch, 
and risk compromise due to attack of a now well- 
known vulnerability? We have presented models for 
the pressures to patch early and to patch later, formally 
modeled these pressures mathematically, and popu- 
lated the model with empirical data of failures in secu- 
rity patches and rates of exploitation of known flaws. 
Using these models and data, we have presented a 
notion of an optimal time to apply security updates. 
We observe that the risk of patches being defective 
with respect to time has two knees in the curve at 10 
days and 30 days after the patch’s release, making 10 
days and 30 days ideal times to apply patches. It is our 
hope that this model and data will both help to inspire 
follow-on work and to form a best-practice for diligent 
administrators to follow. 
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